Educational AI Leaderboard
Understanding Benchmarks and Leaderboards
What Are Benchmarks?
Benchmarks are standardized tests for AI models. Just as students take the SAT to demonstrate college readiness, AI models are evaluated on benchmark datasets to measure their capabilities on specific tasks – like grading essays, answering math questions, or understanding language.
These standardized evaluations are valuable because they provide apples-to-apples comparisons across different models. When a company releases a new AI model, they typically report its performance on popular benchmarks, allowing researchers and practitioners to quickly assess how it compares to existing options. In this way, benchmarks don’t just measure progress – they help steer it. Achieving state-of-the-art results on well-known benchmarks has become a key marker of technological advancement.
What Are Leaderboards?
Leaderboards collect model performance across multiple benchmarks in one place, making it easier to identify the right tool for your needs. Some leaderboards, like LiveBench or LMArena, evaluate capabilities across reasoning, coding, math, and instruction-following. Others focus on specific domains – for example, MathArena concentrates on particularly challenging mathematical problems.
The Learning Agency Leaderboard
Why Education Needs Its Own Leaderboard
The Learning Agency Leaderboard focuses specifically on educational applications. General-purpose benchmarks don’t always capture what matters most in education—like providing accurate, constructive feedback on student work or identifying specific misconceptions that require instructional support. Our leaderboard addresses this gap by featuring benchmarks designed around real educational tasks, with private test sets to ensure models are truly generalizing rather than memorizing training data.
This focus serves two audiences:
For practitioners – teachers, administrators, ed-tech developers – the leaderboard provides directly relevant performance data on tasks you might be interested in: automated essay scoring, math grading, and misconception detection. We also include cost comparisons, recognizing that the “best” model isn’t always the most practical choice for schools and organizations operating under budget constraints.
For researchers, the leaderboard increases visibility for newly developed educational benchmarks. As established benchmarks saturate and lose discriminative power, fresh evaluation datasets help steer AI development toward novel challenges. By providing a dedicated platform for these benchmarks, we help ensure educational assessment tools gain research community attention and potentially influence future model development toward educationally relevant capabilities.
What You’ll Find Here
This page presents the initial iteration of our educational AI leaderboard, featuring two domain-specific benchmarks with private test sets:
- ASAP 2.0 – Automated scoring of student persuasive essays
- Eedi Misconception Annotation Project – Identifying student math misconceptions
Each benchmark includes detailed methodology, performance results across major AI models, and practical guidance for implementation in educational settings.
How to Get Involved
We are actively seeking new benchmarks to expand this leaderboard. We are looking for datasets featuring educationally meaningful tasks (assessment, feedback, tutoring, content generation, etc.) with validated annotations from domain experts, evaluation criteria that can be automatically computed, and private test sets to prevent data contamination.
For organizations with datasets, our team can help handle the technical infrastructure – setting up automated evaluations, running benchmarks across leading models, and maintaining regular updates as new models are released. Through this collaboration, dataset creators gain visibility and regularly updated performance metrics across models, practitioners get access to directly relevant benchmarks for their educational applications, and the broader research community benefits from a centralized resource that helps identify gaps and opportunities in educational AI development.
Contact Joon Suh Choi to learn how to integrate your benchmark while maintaining the integrity of private test data.
Introduction
Performance Comparison



