Skip to content
The Learning Agency
  • Home
  • About
    • About Us
    • Our Team
    • Our Openings
  • Our Work
    • Services
    • Case Studies
    • Guides & Reports
    • The Learning Exchange
    • Newsroom
  • The Cutting Ed
  • Home
  • About
    • About Us
    • Our Team
    • Our Openings
  • Our Work
    • Services
    • Case Studies
    • Guides & Reports
    • The Learning Exchange
    • Newsroom
  • The Cutting Ed
The Learning Agency
  • Home
  • About
    • About Us
    • Our Team
    • Our Openings
  • Our Work
    • Services
    • Case Studies
    • Guides & Reports
    • The Learning Exchange
    • Newsroom
  • The Cutting Ed
  • Home
  • About
    • About Us
    • Our Team
    • Our Openings
  • Our Work
    • Services
    • Case Studies
    • Guides & Reports
    • The Learning Exchange
    • Newsroom
  • The Cutting Ed
Back to Guides & Resources
  • Guides & Resources

Datasets

Datasets are transformative. In education, they can help drive the exploration of educational trends, can be used to uncover key insights, and are instrumental in identifying improvements in both learning outcomes and instructional practices. They also serve as the foundation for artificial intelligence (AI), machine learning (ML), and data science, which are revolutionizing how education is delivered and experienced by students. By structuring complex information and streamlining analysis, they create a foundation for faster, more impactful innovation.

The Learning Agency strives to develop and disseminate datasets to support a wide range of professionals in education and technology. Our dataset work is grounded in three fundamental principles:

  • Inspiration – Enable users to create new ML models, adapt existing ones, or develop custom datasets that support their unique goals.
  • Education – Bridge the gap between machine learning, AI, and real-world educational applications.
  • Access – Make it easier to collaborate, share data, and build on the work of others across the broader learning engineering ecosystem.

Note that many of the datasets below were produced by the Learning Agency Lab. That organization is in the process of sunsetting and so the Learning Agency will maintain those datasets to ensure they remain publicly accessible.

Available Datasets

The datasets below reflect these principles in action, designed to power real-world solutions across writing assessment, reading comprehension, privacy protection, and game-based learning. Each dataset is open-source, well-documented, and ready to support a variety of research, product development, and applied AI use cases. 

All datasets have been created through data science competitions or research collaborations and are available to download in accessible formats like CSV or XLSX. They are each licensed under CC BY 4.0, allowing for reuse, adaptation, and redistribution with proper attribution.

The current collection includes seven datasets covering key areas in education:

    • Quest – Reading comprehension question pairs used to train generative models for educational content
    • Jo Wilder – Game interaction logs used to analyze student engagement and predict learning outcomes
    • PERSUADE – Annotated student essays highlighting argumentative structure, rhetorical elements, and quality ratings
    • KLICKE – Keystroke logs capturing real-time writing process data, including pauses, deletions, and bursts
    • AIDE – A collection of student and AI-generated essays used to train models that detect LLM-written content
    • PIILO – Student writing annotated for personally identifiable information to support privacy-preserving model development
    • ASAP 2.0 –  Student argumentative essays with holistic scores to support equitable, automated writing assessment and reduce bias in model development.
    • College Readiness Math Questions – SAT-style math questions with expert annotations to support automated skill assessment and question generation.

Users can browse the collection to find data that fits their needs, whether it’s training AI models, building educational tools, conducting academic research, or supporting product development. As the collection grows, additional datasets will be added to support new research questions and product applications. We invite you to explore the datasets and use them to power your next project in educational AI and data science.

The Quest Dataset

Created through the “Quest for Quality Questions” competition, this dataset supports the development of AI models that generate reading comprehension questions for young learners. High-quality questions are essential for evaluating narrative understanding, but they’re time-consuming to create. This dataset, built on the FairytaleQA dataset of 10,580 questions, helps automate that process by focusing on comprehension skills often overlooked in similar resources.

The dataset is licensed under CC BY 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

Jo Wilder Dataset

This dataset originates from the “Predict Student Performance from Game Play” competition. The competition focused on predicting student performance during game-based learning in the educational point-and-click game Jo Wilder and the Capitol Case, which is aligned with 3rd-5th grade Wisconsin social studies standards. It includes one of the largest open collections of game logs and is designed to support research and model development in game-based learning and assessment.

The Jo Wilder dataset is licensed under CC BY 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

PERSUADE Dataset

The Persuasive Essays for Rating, Selecting, and Understanding Argumentative and Discourse Elements (PERSUADE) corpus originated from the Feedback Prize project, an initiative by Georgia State University and The Learning Agency Lab, now merged into The Learning Agency. It includes labeled data for over 14,000 nationally-representative student essays from grades 6-12, focusing on argumentative and rhetorical elements within each essay. The dataset contains effectiveness ratings of discourse elements, holistic essay quality scores, and student demographic information including grade level, race/ethnicity, and economic background. It supports research and model development in automated writing feedback and analysis.

The PERSUADE dataset is licensed under CC BY 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

KLICKE Dataset

The Keystroke Logs in Compositions for Knowledge Evaluation (KLICKE) Corpus was released through the “Linking Writing Processes to Writing Quality” competition held from late 2023 to early 2024. It contains detailed keystroke logs capturing writing process features such as pauses, deletions, bursts, and process variance. The dataset supports research and model development aimed at understanding and predicting writing quality through the writing process. It also helps train artificial intelligence models for automated writing evaluation, intelligent tutoring systems, and writing support tools.

The KLICKE dataset is licensed under CC BY 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

AIDE Dataset

The AI Detection for Essays (AIDE) originates from the “LLM – Detect AI Generated Text” competition, held in early 2024. This competition challenged participants to develop machine learning models to distinguish between student-written essays and those generated by large language models (LLMs). The dataset includes 10,000 essays written in response to seven prompts, composed by students and various LLMs. It supports research and development of AI tools aimed at detecting AI-generated text to help educators mitigate plagiarism and preserve learning integrity.

The AIDE dataset is licensed under CC BY 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

PIILO Dataset

The Personally Identifiable Information Location (PIILO) comes from the “PII Data Detection” competition, held in early 2024, which focused on detecting and annotating personally identifiable information (PII) in student essays. It contains approximately 22,000 essays written by students in response to a single assignment prompt from an open online course. The dataset supports research and model development aimed at automating PII detection to protect student privacy and enable safer sharing of educational data.

The PIILO dataset is licensed under CC BY 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

ASAP 2.0 Dataset

This dataset originates from the ASAP 2.0 (Automated Student Assessment Prize) competition, which concluded in July 2024.  It includes approximately 24,000 student-written argumentative essays aligned to current standards for student-appropriate assessments. Designed to support research in automated essay scoring, the dataset enables model development for holistic scoring, timely feedback, and improved writing instruction, particularly in underserved communities. It also includes diverse samples across economic and geographic groups to help reduce algorithmic bias.

The ASAP 2.0 dataset is licensed under CC BY. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

College Readiness Math Questions Dataset

The College Readiness Math Questions Dataset is a curated collection of 434 SAT-style mathematics questions. Each question is annotated by expert raters across 19 skill categories and over 100 specific learning objectives. Designed to support research in educational assessment and natural language processing, the dataset includes structured feedback on question quality and answer choices, along with domain-aligned labels in Algebra, Advanced Math, Data Analysis, and Geometry. The problems, labels, and tasks were generated by large language models and comments were provided by math teachers. 

The College Readiness Math Questions Dataset is licensed under CC BY. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

Twitter Linkedin
Previous Post

Leave a Comment Cancel Reply

Your email address will not be published. Required fields are marked *

More Guides & Resources

Choosing The Right Annotation Platform

In order to extract meaningful insights from data, it must be cleaned, structured, and labeled. This process is known as data annotation.

Read More
The Universal Math Exam Pilot: Using LLMs To Generate Math Test Items

The Universal Math Exam explored whether artificial intelligence could generate valid, high-quality math questions that assess core reasoning skills.

Read More
Introduction to Learning Engineering

Learning engineering is computer science + data science + learning science, applied in real education settings.

Read More

Contact Us

General Inquiries

info@the-learning-agency.com

Media Inquiries

press@the-learning-agency.com

X-twitter Linkedin

Mailing address

The Learning Agency

700 12th St N.W

Suite 700 PMB 93369

Washington, DC 20002

Stay up-to-date by signing up for our weekly newsletter

© Copyright 2025. The Learning Agency. All Rights Reserved | Privacy Policy

Stay up-to-date by signing up for our weekly newsletter