Datasets

Guides & Resources

Datasets are transformative. In education, they can help drive the exploration of educational trends, can be used to uncover key insights, and are instrumental in identifying improvements in both learning outcomes and instructional practices. They also serve as the foundation for artificial intelligence (AI), machine learning (ML), and data science, which are revolutionizing how education is delivered and experienced by students. By structuring complex information and streamlining analysis, they create a foundation for faster, more impactful innovation.

The Learning Agency strives to develop and disseminate datasets to support a wide range of professionals in education and technology. Our dataset work is grounded in three fundamental principles:

Inspiration – Enable users to create new ML models, adapt existing ones, or develop custom datasets that support their unique goals.
Education – Bridge the gap between machine learning, AI, and real-world educational applications.
Access – Make it easier to collaborate, share data, and build on the work of others across the broader learning engineering ecosystem.

Note that many of the datasets below were produced by the Learning Agency Lab. That organization is in the process of sunsetting and so the Learning Agency will maintain those datasets to ensure they remain publicly accessible.

Available Datasets

The datasets below reflect these principles in action, designed to power real-world solutions across writing assessment, reading comprehension, privacy protection, and game-based learning. Each dataset is open-source, well-documented, and ready to support a variety of research, product development, and applied AI use cases.

All datasets have been created through data science competitions or research collaborations and are available to download in accessible formats like CSV or XLSX. They are each licensed under CC BY 4.0, allowing for reuse, adaptation, and redistribution with proper attribution.

The current collection includes seven datasets covering key areas in education:

- Quest – Reading comprehension question pairs used to train generative models for educational content
- Jo Wilder – Game interaction logs used to analyze student engagement and predict learning outcomes
- PERSUADE – Annotated student essays highlighting argumentative structure, rhetorical elements, and quality ratings
- KLICKE – Keystroke logs capturing real-time writing process data, including pauses, deletions, and bursts
- AIDE – A collection of student and AI-generated essays used to train models that detect LLM-written content
- PIILO – Student writing annotated for personally identifiable information to support privacy-preserving model development
- ASAP 2.0 – Student argumentative essays with holistic scores to support equitable, automated writing assessment and reduce bias in model development.
- College Readiness Math Questions – SAT-style math questions with expert annotations to support automated skill assessment and question generation.

Users can browse the collection to find data that fits their needs, whether it’s training AI models, building educational tools, conducting academic research, or supporting product development. As the collection grows, additional datasets will be added to support new research questions and product applications. We invite you to explore the datasets and use them to power your next project in educational AI and data science.

The Quest Dataset

Created through the “Quest for Quality Questions” competition, this dataset supports the development of AI models that generate reading comprehension questions for young learners. High-quality questions are essential for evaluating narrative understanding, but they’re time-consuming to create. This dataset, built on the FairytaleQA dataset of 10,580 questions, helps automate that process by focusing on comprehension skills often overlooked in similar resources.

Download the dataset here.

The dataset is licensed under CC BY 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

Jo Wilder Dataset

This dataset originates from the “Predict Student Performance from Game Play” competition. The competition focused on predicting student performance during game-based learning in the educational point-and-click game Jo Wilder and the Capitol Case, which is aligned with 3rd-5th grade Wisconsin social studies standards. It includes one of the largest open collections of game logs and is designed to support research and model development in game-based learning and assessment.

Download the dataset here.

The Jo Wilder dataset is licensed under CC BY 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

PERSUADE Dataset

The Persuasive Essays for Rating, Selecting, and Understanding Argumentative and Discourse Elements (PERSUADE) corpus originated from the Feedback Prize project, an initiative by Georgia State University and The Learning Agency Lab, now merged into The Learning Agency. It includes labeled data for over 14,000 nationally-representative student essays from grades 6-12, focusing on argumentative and rhetorical elements within each essay. The dataset contains effectiveness ratings of discourse elements, holistic essay quality scores, and student demographic information including grade level, race/ethnicity, and economic background. It supports research and model development in automated writing feedback and analysis.

Download the dataset here.

The PERSUADE dataset is licensed under CC BY 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

KLICKE Dataset

The Keystroke Logs in Compositions for Knowledge Evaluation (KLICKE) Corpus was released through the “Linking Writing Processes to Writing Quality” competition held from late 2023 to early 2024. It contains detailed keystroke logs capturing writing process features such as pauses, deletions, bursts, and process variance. The dataset supports research and model development aimed at understanding and predicting writing quality through the writing process. It also helps train artificial intelligence models for automated writing evaluation, intelligent tutoring systems, and writing support tools.

Download the dataset here.

The KLICKE dataset is licensed under CC BY 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

AIDE Dataset

The AI Detection for Essays (AIDE) originates from the “LLM – Detect AI Generated Text” competition, held in early 2024. This competition challenged participants to develop machine learning models to distinguish between student-written essays and those generated by large language models (LLMs). The dataset includes 10,000 essays written in response to seven prompts, composed by students and various LLMs. It supports research and development of AI tools aimed at detecting AI-generated text to help educators mitigate plagiarism and preserve learning integrity.

Download the dataset here.

The AIDE dataset is licensed under CC BY 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

PIILO Dataset

The Personally Identifiable Information Location (PIILO) comes from the “PII Data Detection” competition, held in early 2024, which focused on detecting and annotating personally identifiable information (PII) in student essays. It contains approximately 22,000 essays written by students in response to a single assignment prompt from an open online course. The dataset supports research and model development aimed at automating PII detection to protect student privacy and enable safer sharing of educational data.

Download the dataset here.

The PIILO dataset is licensed under CC BY 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

ASAP 2.0 Dataset

This dataset originates from the ASAP 2.0 (Automated Student Assessment Prize) competition, which concluded in July 2024. It includes approximately 24,000 student-written argumentative essays aligned to current standards for student-appropriate assessments. Designed to support research in automated essay scoring, the dataset enables model development for holistic scoring, timely feedback, and improved writing instruction, particularly in underserved communities. It also includes diverse samples across economic and geographic groups to help reduce algorithmic bias.

Download the dataset here.

The ASAP 2.0 dataset is licensed under CC BY. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

College Readiness Math Questions Dataset

The College Readiness Math Questions Dataset is a curated collection of 434 SAT-style mathematics questions. Each question is annotated by expert raters across 19 skill categories and over 100 specific learning objectives. Designed to support research in educational assessment and natural language processing, the dataset includes structured feedback on question quality and answer choices, along with domain-aligned labels in Algebra, Advanced Math, Data Analysis, and Geometry. The problems, labels, and tasks were generated by large language models and comments were provided by math teachers.

Download the dataset here.

The College Readiness Math Questions Dataset is licensed under CC BY. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

1 thought on “Datasets”

Beverly Woolf
01/18/2026 at 12:17 PM

Your Learning Engineering Datasets webpage is terrific.
Thanks for compiling these datasets for researchers and teachers.

We all appreciate having a source for new research questions and product application.

Please ensure that they remain publicly available.

Reply

Datasets

Available Datasets

The Quest Dataset

Jo Wilder Dataset

PERSUADE Dataset

KLICKE Dataset

AIDE Dataset

PIILO Dataset

ASAP 2.0 Dataset

College Readiness Math Questions Dataset

1 thought on “Datasets”

Leave a Comment Cancel Reply

More Guides & Resources

The Hidden Roots of Mathematics Struggles: Foundational Gaps Over Conceptual Peaks

Closing the Child Speech Recognition Gap: Evidence, Limitations, and Paths Forward

Finetuning ASR Models for Child Voices

Datasets

Available Datasets

The Quest Dataset

Jo Wilder Dataset

PERSUADE Dataset

KLICKE Dataset

AIDE Dataset

PIILO Dataset

ASAP 2.0 Dataset

College Readiness Math Questions Dataset

1 thought on “Datasets”

Leave a Comment Cancel Reply

More Guides & Resources

The Hidden Roots of Mathematics Struggles: Foundational Gaps Over Conceptual Peaks

Closing the Child Speech Recognition Gap: Evidence, Limitations, and Paths Forward

Finetuning ASR Models for Child Voices

Stay up-to-date by signing up for our weekly newsletter