Designing AI to Understand Every Student: Voice Datasets for More Effective Speech Recognition

December 11, 2025

Joon Choi, Daniel Dela Cruz

How could better voice datasets unlock AI-powered learning for students who don’t speak English as a first language?

Power of ASR in Learning

For millions of students worldwide, limited English proficiency is a classroom challenge and significant barrier to opportunity. Research consistently shows that language skills are one of the strongest predictors of economic mobility. Immigrants with Low English proficiency (LEP) face steep and well-documented disadvantages, with approximately 30 percent of immigrants with LEP face language barriers in obtaining or maintaining employment, with immigrants with very low language proficiency being five times more likely to be in low-skilled jobs, and with working-age LEP adults earning 25-40 percent less than their English proficient counterparts.

Meanwhile, Voice AI applications in education, including in language learning, represent a growing sector – foundational LLMs are increasingly used as conversational partners, and new language learning applications continue to emerge. In a recent survey conducted by The Learning Agency, over 80 percent of English and language arts teachers said they would consider, or are open to the idea of, using speech recognition to support literacy instruction.

For non-native English speakers, AI-powered Automatic Speech Recognition (ASR) technologies represent a needed scalable solution that can improve English language access and proficiency. Unfortunately, current ASR systems fall short for non-English learners.

When ASR Doesn’t Understand You

Currently, state-of-the-art ASR models exhibit systematic bias against non-native English speakers, particularly those from tonal language backgrounds. Larger models trained primarily on native English speakers may show a conformation bias, resulting in greater performance disparity for those that are not. Studies demonstrate that Whisper and similar models perform significantly worse for speakers of Mandarin, Vietnamese, and Thai, with persistent accuracy gaps across tonal and underrepresented accents. As voice AI adoption accelerates in workplace applications, the gap between ASR performance for native versus non-native speakers undermines equitable access to AI-powered tools.

Narrowing the Gap in AI-Powered ASR

As ASR tools evolve, developers must account for non-native English speakers in both their tools’ pedagogical approach and its functional design through translanguaging and adaptive AI supports, respectively. This starts with better data, in terms of quality, scale, and the level of pedagogical scaffolding enabled by annotations.

Large-scale, High-quality Data

This systematic bias can be mitigated through improved models built upon large-scale, diverse, high-quality training and evaluation data that reflects real-world L2 English speech patterns. Current datasets underrepresent non-native English speakers and lack authentic multi-speaker contexts across different settings. The primary solution is a substantial accented speech recognition corpus featuring multi-turn conversations with L2 English speakers from diverse L1 backgrounds and proficiency levels.

Translanguaging and Codeswitching

Translanguaging is a teaching approach where educators blend the learner’s native language with the target language. For example, a tutor might explain a challenging grammar rule in Spanish before switching back to English for practice, helping learners grasp complex ideas while reducing anxiety and increasing confidence. While humans naturally do this, AI systems do not.

Spoken communication makes translanguaging even more important. People switch languages more fluidly in conversation than in writing, meaning that voice-based datasets are essential for training AI tutors to support authentic language-learning interactions. Without these datasets, AI tools miss the cues that signal when a learner needs clarification, translation, or reassurance in their native language.

With the rise of AI technology in classrooms, there is an opportunity to significantly expand access to high-quality language learning and improve economic opportunities for non-native English speaking learners. However, the data used to train ASR models must be free from bias and diverse.

Adaptive Voice AI Support

Another core limitation of current ASR systems is their lack of real-time adaptation. Research shows that slower speech rates, strategic pausing, and learner control over pacing can improve listening comprehension, though optimal speeds vary by context and excessive slowness may harm understanding. Other studies find no comprehension benefits from slower speech, highlighting the need for personalized, context-aware adaptation rather than one-size-fits-all approaches.

For AI systems to provide personalized support, they need annotated voice datasets that capture when and how non-native English speakers signal confusion, when they benefit from simpler phrasing, and when strategic language switching is most effective.

Better Datasets for More Equitable Learning

Researchers can greatly improve ASR for non-native speakers through large-scale diverse training data, while smaller, annotated datasets can explore and improve pedagogical support through translanguaging and adaptive features in ASR applications.

Fortunately, for the sake of efficiency, these components can be merged into a single comprehensive annotated resource: a large-scale accented speech corpus provides the foundation, while the same learner conversations can be annotated for both translanguaging moments and adaptive support needs. This unified approach enables researchers to evaluate ASR systems holistically, and could lead to breakthroughs in ASR-backed educational technologies that improve learning no matter what language a learner grew up speaking.

With the rise of AI technology in classrooms, there is an opportunity to significantly expand access to high-quality language learning and improve economic opportunities for non-native English speaking learners. However, the data used to train ASR models must be free from bias and diverse. By investing in these datasets, developers can build ASR technologies that not only recognize every student’s voice, but also respond to their needs in real time. With thoughtful data design and inclusive modeling, AI-powered learning can support all students.

Joon Suh Choi

Data Scientist

Daniel Dela Cruz

Program Associate