Skip to content
The Learning Agency
  • Home
  • About
    • About Us
    • Our Team
    • Our Openings
  • Our Work
    • Services
    • Case Studies
    • Guides & Reports
    • The Learning Exchange
    • Newsroom
  • The Cutting Ed
  • Home
  • About
    • About Us
    • Our Team
    • Our Openings
  • Our Work
    • Services
    • Case Studies
    • Guides & Reports
    • The Learning Exchange
    • Newsroom
  • The Cutting Ed
The Learning Agency
  • Home
  • About
    • About Us
    • Our Team
    • Our Openings
  • Our Work
    • Services
    • Case Studies
    • Guides & Reports
    • The Learning Exchange
    • Newsroom
  • The Cutting Ed
  • Home
  • About
    • About Us
    • Our Team
    • Our Openings
  • Our Work
    • Services
    • Case Studies
    • Guides & Reports
    • The Learning Exchange
    • Newsroom
  • The Cutting Ed

New Educational Datasets For AI Research

The Cutting Ed
  • September 12, 2025
Jules King, Kennedy Smith

Ongoing disruptions within the U.S. Department of Education have created a major roadblock for the future of learning: access to critical education data. 

While many federal datasets have never been easily accessible, recent staffing cuts and reductions in research capacity have made it even harder for researchers to obtain the information about the current state of education, needed to drive interventions and shape effective policy. 

There are even growing concerns about the future of National Assessment of Educational Progress (NAEP), given the many cutbacks at the Department. 

But without good data, researchers are unable to evaluate new teaching methods, analyze student outcomes, or even train AI models that could personalize and enhance learning experiences. The consequences of this threaten to stall educational innovation and leave students, educators, and ed tech solutions without the tools they need to move forward.

But there is a path ahead, at least for datasets. A growing ecosystem of alternative datasets is emerging, giving researchers new ways to conduct meaningful studies and advance data-driven educational tools without relying on government data.

These new datasets will not be as comprehensive as federal datasets, especially not NAEP.  But the datasets do give the education community an opportunity to continue to innovate, even in the face of federal setbacks.

Below are eight datasets that provide valuable insights into student learning, instructional practices, and personalized education, each with real world applications for research and practice.

Student Open-Ended Responses to Math Videos by Eedi

This dataset comes from an online experiment conducted in 2023 involving 5,423 UK students aged 11 to 16. Students engaged with 516 unique quizzes across 15,022 sessions. The experiment evaluated the effectiveness of a free-text answer box used after teaching videos, comparing four instructional approaches: a baseline group and three groups that introduced varying levels of prompts and feedback to improve engagement. The dataset is intended to help researchers better understand how different interactive strategies affect learning outcomes in online educational environments. Researchers can request access to the dataset here.

Use Cases:

  • Studying different type of feedback affect student knowledge acquisition
  • Training NLP models to assess student responses
  • Designing better user experiences in online learning environments

MRBench by Kaushal Kumar Maurya et al.

This dataset compiles 192 math tutoring conversations, featuring 1,596 tutor turns or responses, covering various student mistakes. Each tutor response includes both human and large language model responses that are labeled by experts across multiple pedagogical dimensions, providing a gold standard for evaluation. The dataset can be accessed here.

Use Cases:

  • Benchmarking AI tutors against human responses
  • Studying error-specific tutoring strategies
  • Developing intelligent tutoring systems with pedagogical grounding

NeurIPS 2022 by Eedi

This time-series dataset builds on the NeurIPS 2020 Education Challenge and explores causal relationships in student learning. It captures how one construct might influence performance and understanding in another. The dataset can be downloaded here.

Use Cases:

  • Modeling causal relationships in educational pathways
  • Designing interventions to boost long-term learning outcomes
  • Improving adaptive learning systems with predictive insights

Longitudinal Math Dataset by ASSISTments

The dataset contains information on over 200 students, tracking them from middle school through to college while capturing behavioral markers related to student engagement. It includes contextual data such as grade level, gender, school-level poverty, and school type (rural vs. urban). The data can be requested here.

Use Cases:

  • Building early warning systems for student disengagement in STEM
  • Long-term impact studies of middle school interventions
  • Predictive modeling for college readiness and career pathways

Identifying Effective Teacher Moves

This dataset contains 567 transcripts, comprising 174,186 annotated teacher utterances and 59,874 student utterances. Each transcript encompasses an entire math lesson, typically 55 minutes long. The transcripts were human-generated from classroom audio and video recordings. Additionally, the transcripts were annotated for 10 teacher talk moves (e.g., restating, pressing for accuracy, revoicing, and making a claim). Where possible, the dataset includes descriptive information such as the recording date, teacher gender, grade level, and original language. The dataset can be accessed here.

Use Cases:

  • Training machine learning models to analyze teaching practices
  • Creating automated coaching tools for teacher professional development
  • Evaluating instructional quality across different classroom settings

Persuasive Essays for Rating, Selecting, and Understanding Argumentative and Discourse Elements

This dataset includes over 14,000 essays from students in grades 6–12, labeled for argumentative and rhetorical elements in response to included reading passages. It contains effectiveness ratings for these discourse elements and holistic quality scores for each essay.

The dataset includes writing from students with acute literacy-related needs (e.g., autism spectrum disorder, dual-sensory impaired, intellectual disability, speech or language impaired) and provides student demographic information such as grade level, race/ethnicity, economic background, and primary exceptionality. The dataset can be downloaded here.

Use Cases:

  • Training models to identify students needing targeted literacy support
  • Creating tools for analyzing discourse element effectiveness in student writing
  • Informing instructional planning based on student group needs

CommonLit Ease of Readability Corpus

This dataset includes approximately 5,000 text excerpts for readers in grades 3–12, labeled for AI-automated reading complexity. Each excerpt includes metadata such as year of publication, genre, and teacher-rated difficulty levels. The texts span over 250 years across two genres, with unique readability criteria informed by ratings from 1,116 educators, primarily native English speakers, with an average age of 40.87. Most raters were female, and the majority held graduate-level education or a bachelor’s degree.

The dataset supports the development of tools to generate or select texts tailored to specific student reading levels, assisting educators in lesson planning and instructional material creation. It enables better alignment of text difficulty with student needs by modeling and refining readability metrics. The dataset can be downloaded here.

Use Cases:

  • Developing and testing text readability models
  • Creating tools for generating leveled instructional texts
  • Supporting lesson planning through accurate text-student matching

The Quest Dataset

This dataset contains approximately 8,900 reading comprehension question-answer pairs based on 244 classic narrative fairy tales authored by writers from diverse cultural backgrounds, including stories from England, Japan, and Native American folklore. Each question-answer pair is annotated for one of seven narrative element categories and for whether the answer is explicit or implicit in the source text. Metadata includes the cultural or ethnic origin of each fairytale (e.g., Scottish, Japanese), supporting the development of culturally responsive educational materials.

The dataset was created using an evidence-based theoretical framework for narrative comprehension and is one of few focusing specifically on reading comprehension questions tied to children’s narrative texts. Questions were annotated by experts in education, psychology, and cognitive science, all with substantial experience in teaching and reading assessment. The dataset can be downloaded here.

Use Cases:

  • Training models to generate culturally sensitive and personally relevant reading comprehension materials
  • Supporting narrative comprehension instruction for K–8 students
  • Creating adaptive learning tools that match content to students’ cultural backgrounds and interests

Reliable educational research shouldn’t cease because of limited access to federal data. These five alternative datasets demonstrate that the education community still has valuable tools at its disposal, tools that can support meaningful research, innovation in learning technologies, and better-informed decision-making. As the landscape shifts, tapping into these resources ensures progress continues, regardless of setbacks.

Jules King

Program Manager

Kennedy Smith

Program Associate

Twitter Linkedin
Previous Post

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Contact Us

General Inquiries

info@the-learning-agency.com

Media Inquiries

press@the-learning-agency.com

X-twitter Linkedin

Mailing address

The Learning Agency

700 12th St N.W

Suite 700 PMB 93369

Washington, DC 20002

Stay up-to-date by signing up for our weekly newsletter

© Copyright 2025. The Learning Agency. All Rights Reserved | Privacy Policy

Stay up-to-date by signing up for our weekly newsletter