Executive Summary
The MAP – Charting Student Math Misunderstandings competition was created to advance methods of identifying and classifying student math errors through machine learning. Hosted on Kaggle, the competition brought together over 1,850 teams and garnered nearly 40,000 model submissions, aiming to improve how misconceptions and other common errors in mathematics are detected. Top-performing models achieved Mean Average Precision (MAP@3) scores above 0.94, highlighting the growing potential to support teachers with actionable insights and enhance student outcomes through timely feedback.
Students frequently make systematic errors in mathematics that stem from either procedural mistakes or deeper conceptual misunderstandings (i.e., misconceptions). Without effective identification and feedback, these systematic errors can persist and interfere with future learning. The competition addressed this challenge by encouraging the development of algorithms capable of diagnosing such errors in the responses of students aged 9 to 14 (or grades 4 to 8). By distinguishing between types of misunderstandings, the competition built on established educational research to enable more precise and personalized instructional responses.
By producing a rich, open dataset of student responses annotated with misconception categories derived from experts at Vanderbilt University, the competition created a foundation for benchmarking models and advancing research in educational diagnostics. Top solutions, which leveraged ensembles of large language models (LLMs) and innovative training strategies, demonstrated that high accuracy and interpretability in detecting student reasoning patterns are achievable. The public availability of this dataset also enables ongoing experimentation, collaboration, and the development of new tools aligned with modern math education standards.
These results highlight the promise of adaptive and automated feedback systems capable of providing immediate support to students while informing teacher practice. Future efforts should expand datasets to additional grade levels and a broader range of misconception types, testing the generalizability of current models. Together, these steps will help translate the progress made through the competition into solutions that improve mathematics teaching and learning outcomes.
Background On Math Errors
Research Basis
Mistakes in math require different interventions:
- Procedural errors occur when students struggle with the steps needed to solve a problem, such as failing to find a common denominator when adding fractions.
- Slips are minor mistakes made despite knowing the correct procedure.
- Conceptual errors stem from misunderstandings of underlying mathematical principles, such as misinterpreting the significance of a fraction’s denominator.
- Some conceptual errors are weakly held beliefs, where a lack of commitment to an idea allows irrelevant factors to influence responses.
- While others are misconceptions—systematic, persistent misunderstandings that arise when students integrate new information into existing mental frameworks. These often arise from preconceptions, formed from past experiences, that can also shape how students approach new concepts, sometimes leading to predictable errors.
Understanding the nature of math errors is critical to tailoring instruction and feedback. Research shows that some errors reflect unfinished learning, requiring additional practice, while others indicate deeper gaps that need targeted intervention. Notable examples of math misconceptions include a whole number bias, where students misapply whole number concepts to decimals and fractions (e.g., misjudge decimals based on number of digits rather than place value). The Math Errors, Interventions, and Opportunities for AI (MEIOAI) Workshop highlighted that math errors result from complex interactions between prior knowledge, instructional cues, and the learning environment.
Feedback and error diagnosis are essential components of effective mathematics instruction because they allow educators to identify the underlying causes of student errors and intervene in ways that support deeper understanding. Students make a range of errors, from slips in calculation to conceptual misunderstandings and misconceptions. Without timely feedback, students may persist in incorrect thinking, reinforcing misconceptions and procedural mistakes that hinder future learning. Providing meaningful feedback helps students recognize and correct errors, strengthens conceptual understanding, and fosters the development of robust mathematical reasoning and problem-solving skills. By addressing errors at their root, educators can prevent the formation of persistent misunderstandings and cultivate learners who approach math with confidence and critical thinking abilities.
By producing a rich, open dataset of student responses annotated with misconception categories derived from experts at Vanderbilt University, the competition created a foundation for benchmarking models and advancing research in educational diagnostics.
Competition Description
The MAP – Charting Student Math Misunderstandings competition was designed to advance methods for identifying and categorizing student math errors in grades 4 to 8 using machine learning and data analysis. The goal was to solicit innovative modeling approaches that can detect patterns of misconceptions and errors in student responses, ultimately supporting improved feedback, diagnostics, and personalized instruction in mathematics education.
The competition was hosted on Kaggle, a platform for data science and machine learning challenges. It was open to data scientists, researchers, and educators interested in applying computational methods to real-world educational data. Participants ranged from academic teams to professionals who use machine learning (e.g., software engineers, data scientists, AI engineers, etc.), bringing diverse expertise in predictive modeling, natural language processing, and educational technology. The competition included over 1,850 teams and resulted in over 39,760 entries.
The dataset consisted of thousands of student responses to math problems, including open-ended written answers and correctness of the associated multiple-choice selections. Key features included student IDs, problem IDs and text, the content of the response, and metadata such as misconception tags and correctness labels. The dataset aimed to capture both procedural and conceptual aspects of student thinking, providing a rich foundation for identifying errors, misconceptions, and reasoning patterns.
Participants were tasked with developing models that could predict the type of misconception or error present in each student’s open-ended response. This included classifying responses into predefined error categories to provide insights into student thinking. Models were evaluated on their accuracy in detecting specific errors.
The dataset aimed to capture both procedural and conceptual aspects of student thinking, providing a rich foundation for identifying errors, misconceptions, and reasoning patterns.
Methodology
Data and Annotations
The project initially sought to leverage data from the NAEP. The team developed a coding scheme and annotated over 60,000 student responses. However, due to changes in staffing at IES, the research team has been unable to publish their research or findings due to lack of personnel to approve disclosure requests. As a result, the competition dataset utilized aligned data from both Eedi and NAEP math problems. The Eedi dataset included students’ multiple-choice responses, written explanations, and associated misconception codes. This dataset was selected for its large sample size and its strong continuity with NAEP math topics and misconception codes of student thinking. The labeling structure supports detailed analysis of student errors, reasoning patterns, and points of misunderstanding.
The dataset consisted of student responses to Diagnostic Questions (DQ) from Eedi, which are multiple-choice questions comprising one correct answer and three distractors. Students also provided optional written explanations for their selected option.
Based on the student response and explanations, correct explanations as well as errors were labeled according to a predefined taxonomy. These categories were derived from prior research in mathematics cognition and aimed to capture common patterns of student thinking. Annotation was performed by 15 trained annotators, all with math tutoring experience, who reviewed student responses and assigned labels based on observable explanations, providing a consistent framework for model training and evaluation. Preprocessing steps included selecting high-quality textual responses, handling missing or inconsistent metadata, and encoding categorical features for machine learning models.
Participants were tasked with developing models that could predict the type of misconception or error present in each student’s open-ended response. This included classifying responses into predefined error categories to provide insights into student thinking. Models were evaluated on their accuracy in detecting specific errors.
Competition Tasks and Objectives
Participants were tasked with predicting the type of understanding present in each student’s response. Each response in the dataset had undergone a three-step process. First, it was determined if the explanation was correct or incorrect, then it was determined if there was a target error or if it was incorrect for other reasons, and finally, if there was an error, it was determined what type of error was present. The identification and categorization of explanations were based on a taxonomy developed by content experts at Vanderbilt. Thus, the annotation schema included three components: classification of correctness, determination of presence of common errors, and identification of the specific error type.
This required classification of correct explanations and error types or misconceptions for each problem, with labels provided based on the reasoning behind student open-ended answers. The objective was to develop models capable of identifying correct and different types of incorrect reasoning to support automated feedback and educational interventions. Human annotation performance was evaluated using classification metrics, primarily kappa scores, to measure agreement with expert annotations.
The performance of the models were evaluated using Mean Average Precision @ 3 (MAP@3). This means that the models were given three chances to predict the correct annotation for a given student response, and received diminishing scores after each incorrect prediction. For example, if the correct label is B and the model predicts [B, A, C] in a sequence, it scores 1.0. If it predicts [A, B, C], it scores 0.5. If it predicts [A, C, B], it scores 0.33. The high MAP@3 scores achieved by top teams means that their models successfully predicted the correct label on the first try for the vast majority of student responses.
Approaches by Teams
Top-performing teams primarily leveraged ensembles of LLMs. Innovative approaches included:
- Shared-prefix attention with custom FlexAttention masks: The first place team framed the task as a ‘suffix classification task’ where each label candidate is encoded as an input token, and for each prediction, all possible label candidates were concatenated as a string. Then, custom FlexAttention masks were applied so that each suffix candidate only attends to the shared prefix, with the last-token features of each suffix being used to get logits for cross-entropy classification. This was different from the causal language modeling approach used by the majority of top-ranking teams..
- Multi-loss training with soft labels: The second place team generated additional training data by averaging predictions from multiple trained models (called ‘soft labels’), then trained with combined loss on both to potentially account for label ambiguity.
- Auxiliary tasks: The third place team required training models to simultaneously predict related sub-problems like whether the answer was correct or what type of reasoning error occurred, providing additional learning signals that improved the main classification task.
Successful approaches in this competition shared several key characteristics:
- Robust validation strategies such as ensembling, k-fold cross validation, and multi-seed validation. For example, the first-place winner noted that multi-seed validation was “the most critical aspect of this competition” for handling noisy labels.
- Advanced inference engineering including quantization (reducing model precision to 4-bit) and multi-stage inference to deploy multiple higher parameter models (32-72B) within the competition’s time constraints. For example, the third-place winner ran expensive 72B models only on low-confidence predictions.
- Data augmentation through the use of synthetic data (for example, soft labels, expanded set of misconception labels, etc.), with LLMs to boost model performance. For example, the second-place team generated 80K synthetic examples using commercial LLMs, and expanded brief misconception codes into detailed descriptions.
Top submissions achieved high accuracy in classifying common misconceptions and errors, as well as correct explanations. The top models achieved scores above 0.948, well above the competition’s high-quality benchmark of 0.75, reflecting both the strength of the field and continued progress in applying AI to math education.
These strategies addressed the competition’s core challenges: noisy and ambiguous labels, computational constraints, and data imbalance across misconception types. By combining stable validation methods with efficient inference and enriched training data, top teams achieved high MAP@3 scores, meaning their models predicted the correct annotation on the first attempt in the vast majority of cases.
Limitations involved those commonly seen in competitive machine learning settings:
- Significant training compute requirements
- Significant inference time requirements
- Optimized for competition-specific features
The top solutions all utilized an ensemble of high-performance, 32-billion and 72-billion parameter models, with quantization or layer-wise inferencing used to account for computing power constraints. This meant that 1) the fine-tuning process typically required high-end GPUs such as A100/RTX 6000, and 2) significant inference time was required (e.g., a single component of a 6-component ensemble model took up to 190 minutes on inference). Additionally, the teams could limit predictions to only the subset of misconceptions previously observed for a question, which may not be transferable to new math content where the misconception patterns are less established. Such limitations are typical of competitive machine learning environments, where participants appropriately focus on maximizing scores within defined constraints rather than ensuring transferability to varied deployment contexts.
Findings and Results
Top submissions achieved high accuracy in classifying common misconceptions and errors, as well as correct explanations. The top models achieved scores above 0.948, well above the competition’s high-quality benchmark of 0.75, reflecting both the strength of the field and continued progress in applying AI to math education. Additionally, the top 1,000 competitor models were able to achieve a 0.94 which increases certainty in the models ability to perform.
Lessons Learned & Impact
The MAP – Charting Student Math Misunderstandings competition demonstrates the potential for data-driven approaches to empower teachers to identify and address gaps in student understanding. By using tools that integrate the competition’s models into diagnostic assessments, educators can pinpoint whether a student’s difficulty arises from a minor procedural mistake or one of several deeper conceptual misunderstandings. This enables targeted feedback and timely interventions, ensuring that misconceptions are corrected before they become entrenched. In turn, students receive support for their specific challenges, improving learning outcomes.
A key contribution of the competition was the development and use of a taxonomy that distinguishes among different common errors in grades 4 to 8. By differentiating between procedural and conceptual errors, educators and researchers can interpret mistakes with greater nuance. Districts and schools can integrate this taxonomy into math teacher professional development programs, embedding it in both pre-service and in-service training. Using these classifications, teachers can better prioritize instructional responses, whether by addressing specific problem-solving steps or directly addressing why common wrong answers are incorrect. This structured approach not only deepens understanding of how students think about mathematics but also supports more consistent analysis across classrooms and studies.
Finally, the competition’s outcomes point toward a future in which adaptive and automated feedback systems can become integral to math instruction. The demonstrated potential of these new models, which categorize student errors to provide immediate support while offering teachers’ insights to guide instruction, underscores a promising shift in how math can be taught.
Research Implications
The public release of the dataset used in the competition represents a meaningful contribution to open data and serves as a foundation for future research in AI-driven math education. By making this dataset publicly available, researchers and developers can benchmark current models and platforms in their ability to detect and classify student thinking using correct and erroneous ideas.
The dataset also provides a foundation for building future datasets aligned with current educational math standards and assessments, supporting the continued refinement and standardization of misconception categorization. These developments will enable the creation of better diagnostic tools for teachers and more precise feedback for students. As models trained on such data become more sophisticated, they will contribute to improved instructional strategies and stronger learning outcomes, advancing both research and classroom practice. Given the decreased STEM teacher workforce and the high demands on teachers, such as large class sizes, automated feedback systems could help teachers maximize their time and personalize learning by targeting instructional approaches based on students’ specific errors without requiring excessive additional effort.
Future work should expand this effort by incorporating additional grade levels such as elementary or high school and capturing a wider range of misconception types. Increasing the diversity and volume of annotated student work will help test the generalizability of current models and ensure that findings extend across varied learning contexts. Continued expansion and openness in data sharing will be essential for driving innovation, fostering collaboration, and deepening understanding of how students learn mathematics.
The public release of the dataset used in the competition represents a meaningful contribution to open data and serves as a foundation for future research in AI-driven math education. By making this dataset publicly available, researchers and developers can benchmark current models and platforms in their ability to detect and classify student thinking using correct and erroneous ideas.
Future efforts can be dedicated to exploring the practical deployment and generalization capacities of these high-performing models. The top 1,000 models on the private leaderboard all performed above 0.941, which is only 0.007 points below the winning model. This suggests the potential to find a ‘balanced’ model in the future, one that offers a good tradeoff between accuracy, speed, and reliability. Such a model would be best suited for a production setting, meaning real-world use, where the algorithm must run efficiently and consistently. The competition established the performance ceiling, and moving forward, researchers have a reasonable benchmark for what constitutes strong performance while remaining practical in terms of computational requirements and inference time.
Future research could also assess the generalization capability of these models beyond the competition’s scope. Testing top-performing models on newly annotated math questions using the same schema would reveal which approaches can generalize to questions it wasn’t trained on. This is particularly relevant given that the training set, public test set, and private test set all shared the same 15 questions, which may have allowed for competition-specific optimization that would not transfer to broader applications. Understanding which modeling strategies maintain robust performance across different question sets would be valuable for deploying misconception detection systems in educational settings.
Finally, future research may be able to re-engage the annotated NAEP dataset. While current staffing issues at IES do not presently allow for research or publication of work on the NAEP datasets, the annotation and research on this dataset have already been completed. At a later time, once these issues are resolved, analyses and comparisons between the two datasets, their models, and their efficacy could be conducted and published, providing further insight into student misconceptions and the generalizability of modeling approaches.
Conclusion
The MAP – Charting Student Math Misunderstandings competition demonstrates the power of combining educational research with machine learning to identify and classify student math errors. The high-performing models and resulting dataset highlight how AI can provide actionable insights for teachers, enabling timely, targeted interventions that address both procedural mistakes and deeper conceptual misunderstandings. By producing an open, annotated dataset and advancing methods for automated error detection, the competition lays a foundation for ongoing innovation in adaptive math instruction and research.
Looking ahead, expanding the dataset to additional grade levels and broader misconception types, along with evaluating model generalization beyond the competition, will be critical steps in translating these advances into practical, effective classroom tools. The results illustrate not only the potential for AI-driven diagnostic systems to enhance student learning but also the importance of collaboration between researchers, educators, and technologists in shaping the future of mathematics education.
This case study was written by Jules King, L Burleigh, Kennedy Smith, Scott Crossley, Bethany Rittle-Johnson, Kelley Durkin, Meg Benner, and Ulrich Boser.