As AI continues to evolve and become more accessible, one question persists: how do we know if AI is actually good, especially in education? Whether you are a researcher or developer fine-tuning Chat-CPT, building an educational tool, or testing model performance, evaluation is critical to measure progress and to ensure accuracy and functionality.
Evaluation methods can be grouped into three categories: (1) comparing large language model (LLM) outputs to ground truth answers, such as checking whether a model solves a math problem correctly; (2) humans rating LLM outputs, commonly used in reinforcement learning from human feedback (RLHF), where models learn from human preferences; and (3) rubric-based grading and relative preferences, where responses are assessed using predefined criteria or compared against each other. This last method is especially useful when no single “correct” answer exists or no strong pre-specified desired characteristics.
Evaluating AI performance can be complex. It can be difficult to find relevant datasets that align with academic standards or real-world tasks. Without standardized or easily accessible benchmarks, comparing or tracking performance meaningfully becomes a challenge.
There are also privacy and safety concerns. Models used in education must be free of bias and be safe for diverse learners. In addition, student data is often sensitive, making it difficult to evaluate tools without compromising privacy. Balancing transparency, fairness, and privacy can be an obstacle.
Strategies To Make Evaluating AI Better
Despite these limitations, there are several avenues to making AI evaluation more reliable, explainable, and accessible.
Datasets Repository. One critical need is the creation of a centralized educational evaluation repository. Currently, the education and learning engineering fields lack a public, standardized resource where developers and researchers can access task-aligned datasets and rubrics that support academic standards, such as Common Core. Existing efforts like Carnegie Mellon’s DataShop or our own website, the Learning Exchange, provide some support, but they are often limited in scope, outdated, or not designed to support emerging use cases.
A centralized repository would streamline the process of LLM evaluation, utilizing relevant data, making their evaluations more reliable and transparent. It would eliminate time-consuming searches for high-quality datasets, which can take weeks. Such a repository could host a wide variety of tasks, ranging from essay feedback and segmentation to reading comprehension and gamified learning, and include standardized scoring tools and instructions. For example, a public dataset of student essays expertly labeled with essay components could be used to fine-tune a writing feedback model. Similarly, a dataset of math questions annotated with common misconceptions could support fine-tuning of math feedback or learning experience models. With such data readily available, evaluations could be more standardized, reducing variability and enabling more precise comparisons across models. By offering shared formats and APIs, a repository could also support easy integration across platforms and research studies, fostering collaboration and innovation across the field. This could be further enhanced by integrating LLM calling or by hosting programming notebooks, similar to Kaggle.
Evaluating AI performance can be complex. It can be difficult to find relevant datasets that align with academic standards or real-world tasks. Without standardized or easily accessible benchmarks, comparing or tracking performance meaningfully becomes a challenge.
Leaderboard. Another essential improvement for evaluating AI would be the development of performance leaderboards dedicated to LLM evaluation in education. The leaderboards would be designed to help edtech developers, schools, and other educational organizations identify which models meet acceptable standards by evaluating and comparing LLM performance on educational tasks, with a focus on cost-effectiveness. A regularly updated leaderboard could offer transparency around how different models perform across specific tasks and contexts, helping teams choose the right AI model for their needs. Building such a leaderboard would require broad collaboration across ed tech platforms and academic researchers to ensure the sharing of relevant datasets and co-developing of benchmarks that reflect real-world educational challenges.
Our team at the Learning Agency is starting to build such a leaderboard. Please reach out if you have ideas or want to partner.
LLM-as-a-Judge. Using LLMs to evaluate other LLMs, known as “LLM-as-a-judge,” is a new alternative method to human LLM evaluation that has become increasingly attractive for its efficiency, cost-effectiveness, and promise.
When supported by clean, well-labeled data and high-quality evaluation rubrics, LLMs can assist in producing consistent, scalable evaluations. As L. Burleigh of The Learning Agency wrote, this can be particularly valuable in the early stages of tool development, to meaningfully speed up workflows and for rapid iteration which is critical for fine-tuning. However, this emerging role for language models is not without limitations.
To improve access and reduce costs, open source evaluation tools also play a role. One example of this in progress is FlexEval, a tool under development by Thomas Christie at the Digital Harbor Foundation. FlexEval allows users to work with custom rubrics and datasets locally, generating performance scores that can be analyzed for consistency and reliability while preserving data privacy. Tools like FlexEval point toward a more scalable future in which evaluation processes can be easily adapted across a range of learning contexts. However, the field needs more flexible, open-access tools that can handle a broad spectrum of educational data types while offering transparency into how scores are generated.
Evaluating AI-powered systems is not just about measuring output, it's about ensuring that AI tools are equitable, effective, and trustworthy in the hands of educators and students.
PII Anonymization Tools. A final strategy to improve AI evaluation is anonymization tools Educational data frequently includes sensitive personal information, especially when it involves minors. Using LLMs without proper safeguards introduces the risk of exposing personally identifiable information (PII). At present, manual data review is the most reliable method of removing PII, but it is labor-intensive and slows down research. Automated systems, such as those based on named entity recognition (NER), can assist by identifying structured data like names, phone numbers, or email addresses. However, they often struggle to distinguish between genuinely sensitive information and innocuous references, such as the name of a published author in an essay.
To solve this, the field urgently needs an open source tool capable of identifying and anonymizing sensitive information across a wide range of contexts. Scott Crossley’s team at Vanderbilt University has been looking into creating such a tool that would use artificial surrogates to replace private data while preserving the structure and educational relevance of the content.
Meanwhile, Professor Ryan Baker at the University of Adelaide, along with other researchers, has explored using the large language model ChatGPT for redacting PII. Their research found that GPT-4 achieved a high recall rate of 0.958 for identifying PII but demonstrated lower precision, redacting text that wasn’t actually sensitive information, highlighting an area for improvement. Despite this, such tools would allow for safer dataset sharing and evaluation without sacrificing user privacy or introducing ethical risks.
Conclusion
Evaluating AI-powered systems is not just about measuring output, it’s about ensuring that AI tools are equitable, effective, and trustworthy in the hands of educators and students. By investing in centralized datasets and standardized rubrics to performance leaderboards and anonymization tools, an ecosystem can be built where AI meaningfully enhances learning without compromising safety or integrity.

Kennedy Smith
Program Associate