5 Approaches (And A Few Other Ideas) To Spur Educational Datasets

October 7, 2024

Scott Crossley, Ulrich Boser

There’s a lot of conversation these days about AI in schools. Some argue that it’s doing too little for students while others worry that it will ultimately replace teachers. While these are important points, what’s often overlooked in conversations about AI in education is how to improve AI so that it is reliable, fair, and trustworthy. These improvements are correlated with how well the data used to train the AI represents the populations which benefit from it.

As an analogy, think of AI as a robot that’s learning how to do a job by watching a lot of examples of that job being done. The better the examples an AI sees, the better it can do its job. If the robot only sees examples from one kind of worker, it will only understand a limited number of strategies for completing its work. This simple analogy takes on paramount importance when considering the number of different types of students, topics, grade levels, and tasks that a teacher considers every day. It also does not take into account the end-users of AI.

Think of AI as a robot that's learning how to do a job by watching a lot of examples of that job being done. The better the examples an AI sees, the better it can do its job.

Consider, for instance, a recent survey by the Walton Family Foundation that found that Black teachers and educators in urban districts had the highest AI usage rates at 86 percent. Unfortunately the datasets used to train current AI are not representatively diverse and may not reflect the needs of many users. For example, a 2023 Stanford University study found that biases in GPT detectors often lead to the misclassification of non-native English writing as AI generated. Even if an AI was trained using educational data, the writing samples, constructed responses, assessment data and learning logs used to create an educational AI would be unlikely to reflect the learning scenarios of the many educators using them and thus, the AI may not be as effective.

However, this need not be the case. By creating, publishing, and promoting strong, diverse, and open educational datasets in research and data science competitions, AI can benefit, foremost by using this data in its training and creating AI models that better serve all student populations.

A good example of this type of data is the Feedback Prize. The Feedback Prize was developed by the Learning Agency Lab and Vanderbilt University. The prize yielded a dataset of ~25,000 student essays from a diverse set of writers and a collection of AI models that can support assistive writing feedback technology. The data is open-source and contains demographic and individual difference codes for students. The data from the Feedback Prize has been used by thousands of scientists to refine writing assistant tools and will be integrated into the next generation of AI models so they can respond better to support queries from a broad population of student writers and teachers. More datasets akin to the Feedback Prize are needed to train AI to respond to the needs of teachers and students in the classroom. Importantly, datasets formed from a variety of disciplines, tasks, and age groups are needed to ensure that the AI used in the classroom provides stakeholders with feedback they deserve.

In the spirit of this goal, we have put together a series of steps to help promote educational datasets that could be used to better fine-tune AI models and, importantly, to help surface ideas for their quick development. Some of these steps are long-term approaches, while others may see quick turnaround. All are needed to ensure AI is effective and helpful for students and educators regardless of background and topic.

Create Specialized Platforms for Education Dataset Challenges

Kaggle is a popular and respected data science platform that hosts datasets and challenges for the data science community. However, Kaggle is expensive to use, can be slow in releasing dataset challenges, and does not specialize in education. Creating a Kaggle-like platform for education could bring together talent and key benchmarks. Such a platform could also be used to host benchmark model performance tools – similar to the SEAL Leaderboards – which could rank LLMs’ performance on educational outcomes using private datasets. A Kaggle-like educational site could be used to form communities of scientists specializing in educational theory and analytics that is also open to all. The site could be used to identify and promote top learning scientists. Adapting incentivizing strategies like badges, medals, and titles would incentivize engagement and attract talent to the site. Importantly, a Kaggle-like site would improve transparency for strengths and weaknesses of educational datasets, models, and other solutions.

A Kaggle-like educational site could be used to form communities of scientists specializing in educational theory and analytics that is also open to all.

Invest in Data Science Competitions

Aside from creating a dedicated educational dataset platform, there are other ways to use data science competitions to develop benchmarks and datasets. An easy solution to do this is to support data competitions or dataset awards at academic conferences.

Examples of this include the NeurIPS Conference (Neural Information Processing Systems), which included a datasets and benchmarks track; BEA Workshop (Building Educational Applications), which often features a shared task competition; the KDD Cup (Knowledge Discovery and Data Mining), which runs a data mining and knowledge discovery competition; CIKM AnalytiCup (Conference on Information and Knowledge Management), which includes several data challenges; and the EDM Dataset Awards (Educational Data Mining Society).

Funding the Development of Transformative Datasets

A major block to developing strong educational datasets is that federal funding agencies generally do not fund datasets, nor do many private philanthropies. Developing datasets with rigorous annotations and outcome variables is very expensive, averaging between $250,000 and $550,000. However, robust datasets are well worth the high cost to develop them.

Developing datasets with rigorous annotations and outcome variables is very expensive, averaging between $250,000 and $550,000. However, robust datasets are well worth the high cost to develop them.

The two biggest factors influencing cost are whether or not the data is prepared or whether or not the dataset requires human annotation. Human annotation is necessary when the dataset contains outcome variables that are not standardized, such as human judgments of quality or annotations of features like misconceptions in math self-explanations. Selecting baseline scores for authentic assessments such as these is difficult and time-consuming and includes rubric development, expert training and rating, and mitigating bias.

An externally prepared dataset often comes from an online educational platform, which captures and records the data, but not necessarily in a format required for a competition. A data analyst must clean and reformat the data to make it usable. This is often required for internally prepared datasets not created for data modeling purposes.

We have been tracking existing datasets that have not yet been released because they are not annotated for specific use cases, contain private data, are proprietary, or are incomplete. The list is far from complete, but it illustrates what data is available. With time or money or both, many of these datasets could be developed to train algorithms which could provide the data and algorithms for fundamental tools in math, writing, reading, even understanding learning and language acquisition. An additional list of existing publicly available datasets was published here, on the Learning Engineering Hub. While new data is always welcome, previously published and utilized datasets such as these can be combined with new data, repurposed with fresh annotation, or re-examined with new analysis, to provide new insights.

While federal funding is rarely dedicated to dataset development, some private funding opportunities and partnerships to build comprehensive datasets do exist. One example of this is the Tools Competition, which funds millions of dollars annually in edtech innovations. Examples of winners who generate benchmark datasets and encourage dataset generation include the University of Minnesota’s Mind Wandering and Teaching Lab’s handwritten math benchmark for LLMs.

Build Robust Mechanisms for Data Sharing

Unlike other fields, sharing educational data within a single space that attracts users is currently difficult. The National Science Foundation (NSF) has supported the DataShop, a data repository for learning science researchers. This is similar to other fields like Biomedical research which has Galaxy, an open-source, web-based platform for data-intensive biomedical research.

A well-funded one-stop education shop with strong uptake is necessary to help share educational datasets and surface new ones. This could be more than just a data repository by including tools to increase the likelihood of sharing and adoption. These tools may include:

Freely available annotation systems that make it easier to mark up new datasets for researchers, administrators, and teachers. See Prodigy as an example.
A tool for Automatic Personal Identifiable Information (PII) removal. PII’s create bottlenecks that prevent uptake. Creating a better algorithmic system to uncover PII could possibly be an underleveraged approach. See the TLA-run Kaggle competition as an example.
Standardized data formats–Many older datasets are in weak sharing formats, particularly for voice and audio content. Creating more standardized ML formats, like the work being done by 1EdTech on interoperability standards, could help.

Strengthen the Learning Engineering Field

The education field often values data science and engineering less so than other fields. To help surface datasets, it would be beneficial to socialize the education field to value data-driven approaches to interventions and theory. Helping to develop data literacy in the next generation of teachers, researchers, and academics would help promote dataset development and sharing.

The education field often values data science and engineering less so than other fields. To help surface datasets, it would be beneficial to socialize the education field to value data-driven approaches to interventions and theory.

A few universities are moving in this direction, but many are prestigious institutions and not accessible to all teachers, learners, and future researchers. Examples of existing programs include learning analytics certificates from Vanderbilt University and North Carolina State University, learning analytics master’s degrees from the University of Pennsylvania, learning analytics online courses from MichiganX, and learning analytics short courses from the Institute of Education Sciences.

Additional Ideas

In addition to the core ideas above, additional ways to spur educational datasets include:

Providing researchers with methods, resources, and locations to distribute existing datasets to help advance knowledge sharing. For instance, creating an open-source academic journal for educational datasets would create a space for publication and increase citations. Nature’s Scientific Data, IEEE Data Descriptions, and Data in Brief are examples of dataset journals for general audiences.
Hosting in-person functions to help surface more educational datasets. For instance, convening researchers, industry partners, and stakeholders to identify existing datasets and develop new ones.
Utilizing hack-a-thons to introduce the mechanisms behind principled datasets to new audiences and socialize the idea of educational datasets.
Creating teacher advisory boards to guide the development of datasets based on classroom needs.

Conclusion

Educational data is an essential tool for improving learning outcomes. However, educational data mining cannot realize its full potential in a dataset vacuum. There are numerous opportunities to source, develop, and engage educational datasets in order to revolutionize how data is used in education. As the field continues to grow, it is essential to build a culture that values data-driven approaches for educators, researchers, policymakers, and funders and that values open data even more. With the right investments and collaboration, we can ensure that educational data becomes a powerful tool for innovation and policy-making, ultimately improving learning outcomes for all students.

Scott Crossley

Professor, Vanderbilt University

Ulrich Boser

CEO, The Learning Agency

1 thought on “5 Approaches (And A Few Other Ideas) To Spur Educational Datasets”

Mercy Cheruiyot Yegon
10/10/2024 at 12:11 AM

This is a wonderful insight into educational data-driven approaches. In my country Kenya, this is very helpful and I request you to support me in the strive to set up a data set and hopefully it can benefit all learners with special needs in education.

Reply