As generative AI tools rapidly enter classrooms, questions about how to evaluate and improve them have become more urgent. Andrew Lan, Associate Professor at the University of Massachusetts Amherst, is a researcher focused on developing generative AI methods for education. By designing systems that can realistically mimic student errors, align responses with different ability levels, and generate rapid-cycle feedback, Lan aims to create new ways to test learning technologies and strengthen teaching practice. In this 5 Questions interview, he reflects on the technical and cognitive challenges of simulating student behavior, the role of data in advancing the field, and how AI and learning science can work together to improve educational innovation.
What is the nature of your work?

My work focuses on developing generative AI methods for educational applications. We cover a wide range of scenarios: math problem generation, identifying concepts in programming problems, student modeling in dialogues, personalizing keyword mnemonics for language learning, and training tutor chatbots to maximize learning outcomes.
Most recently, we have focused on developing authentic simulated student agents powered by LLMs. We have developed methods to help LLMs recognize and mimic student errors, steering LLM output towards different students and tutors, aligning error types with knowledge and ability levels, and building evaluation metrics for how authentic student simulations are.
Why is this work important?
Generative AI-powered simulated student agents, if authentic, can have immense benefits to almost all stakeholders in education. We know that generative AI shows strong promise for improving teaching and learning by delivering interactive learning experiences at scale. However, evaluating new learning technology during development struggles to keep pace, since rigorous A/B tests can be slow and costly. Authentic simulated student agents may enable educators and learning content designers to obtain rapid-cycle feedback when developing learning technologies; they can simply test new technology on simulated students first, examine their learning outcomes, and make timely adjustments if necessary, before starting a randomized controlled trial.
Moreover, with the help of authentic simulated students, instructors can practice their tutoring strategies and find effective feedback mechanisms, parents can better understand learners’ progress and challenges, content designers can develop more targeted content and adjust curricula, while learners themselves can receive personalized learning content and schedules, or even learn through teaching-by-learning activities.
What’s been the biggest surprise so far?
What surprises us is how challenging this task is. In the literature, there exist traditional, more rules-based methods, which are less flexible than LLM-based methods, while LLM-based methods mostly resort to prompting LLMs to behave according to a pre-defined student profile. What we found is that the task is challenging, both naturally and technically: student behavior is inherently uncertain and inconsistent, which leads to noisy training data for AI algorithms. One may think that prompting works, but when compared against what real students do and write, it is very far from the truth.
Certain types of student behavior can just be very hard to simulate. For example, it is very difficult to anticipate what errors students will make in open-ended code submissions to programming problems. We found that it is actually more difficult for LLMs to simulate surface-level syntax errors than deeper logical errors. Another example is in tutor-student dialogues, where we found that it is very hard to anticipate how students might respond to tutors.
Where do you see your work in five years?
From the academic side, we hope to serve as a bridge connecting AI researchers and psychometrics/cognitive science researchers. The approach that we are taking towards student simulation is to develop ways to use widely adopted models, such as item response theory, as external controllers to steer LLMs towards behaving like real students. We look to build models and personalization methods in learning activities that are not limited to rigid problem-solving requiring specific knowledge or skills, but more open-ended and interactive, requiring skills that are more “generative AI-resilient,” such as creativity, critical thinking, collaboration, and communication. These activities are especially important in an era where students frequently interact with generative AI tools and each other, setting the pace and even defining their own learning agenda.
From the practical side, we hope to create a large repository of tools for different stakeholders in education and for researchers in different fields: AI model and algorithm developers can use our simulated student agents to benchmark how their tools can help student learning, tutors and instructors can use our agents to improve teaching, and students can use our agents as learning companions.
What else should people know?
Data is of utmost importance when it comes to using AI to develop new tools. I often find that many of our ideas require data that is yet to be available; the data we use in proof-of-concept papers is often the only thing out there, but is far from ideal. For example, we combined LLMs with typical item response theory and knowledge tracing models to analyze students’ open-ended response to questions, not just response correctness, before the release of ChatGPT. However, despite our motivation being deeply rooted in math, where students’ full problem-solving process may give us much more information than just whether they solved the problem correctly or not, no such dataset exists; we had to use a dataset with open-ended coding problems.
Even the textual statement of questions was not publicly available until Eedi’s NeurIPS 2020 Education Challenge dataset, but even for that data, we had to run OCR to extract text from images. Open datasets are what drives research forward, but unfortunately they remain a bottleneck right now. Efforts such as NSF’s SafeInsights and data released by Eedi, ASSISTments, and Khan Academy are a good step in this direction, though.
