Skip to content
The Learning Agency
  • Home
  • About
    • About Us
    • Our Team
    • Our Openings
  • Our Work
    • Services
    • Case Studies
    • Guides & Reports
    • The Learning Exchange
    • Newsroom
  • The Cutting Ed
  • Home
  • About
    • About Us
    • Our Team
    • Our Openings
  • Our Work
    • Services
    • Case Studies
    • Guides & Reports
    • The Learning Exchange
    • Newsroom
  • The Cutting Ed
The Learning Agency
  • Home
  • About
    • About Us
    • Our Team
    • Our Openings
  • Our Work
    • Services
    • Case Studies
    • Guides & Reports
    • The Learning Exchange
    • Newsroom
  • The Cutting Ed
  • Home
  • About
    • About Us
    • Our Team
    • Our Openings
  • Our Work
    • Services
    • Case Studies
    • Guides & Reports
    • The Learning Exchange
    • Newsroom
  • The Cutting Ed

Can AI Hear What a Child is Really Saying?

The Cutting Ed
  • August 18, 2025
Amy Quarkume, Dorothy Oteng

In a quiet therapy room, a child mouths the word “ball.” The sound that emerges may be partial, just “ba” or a gust of breath. To a trained speech language pathologist, that attempt is rich with information: how the child moves their mouth, the sounds they are trying to make, and clues about how their speech is developing.

Now imagine that moment through the lens of a machine. Can AI-powered real-time voice agents – listen to spoken language and respond immediately – detect what was really said, not just what they expected to hear?

That question is becoming increasingly relevant. As more educators and clinicians use these agents to guide children through speech exercises or to assist with note-taking during sessions, researchers and developers are grappling with how to adapt them for children, especially those with speech delays.

These models, originally designed for fluent adult speakers, must now work with speech that is not always clear, includes a range of accents, and often happens in noisy environments.

Recent efforts in the field offer insight into what is possible, what remains challenging, and how thoughtful orchestration of voice technologies might help children feel more heard, literally and figuratively.

Reimagining Feedback Through A Familiar Voice

At-home practice is a cornerstone of early language development, especially for young children working to overcome speech delays. But many families struggle to carry over exercises from therapy into daily life. One promising approach involves using voice cloning to recreate a caregiver’s voice, then embedding it into interactive activities designed to support speech goals.

This strategy isn’t simply about personalization for its own sake. Research shows that familiar voices, especially those of trusted adults, are known to increase engagement and emotional receptivity in children. When a child hears prompts or encouragement in their parent’s voice, they may be more likely to repeat words, mimic pronunciation, or stay focused on an activity.

To implement this, parents record a short audio sample and upload it to a voice cloning tool that creates a digital version of their voice with a similar tone and rhythm. The resulting voice is then used to guide the child through speech practice, offering encouragement or prompts in a warm and familiar tone.

Different voice tools come with their own strengths and weaknesses. Open research models like VoiceCraft are easier for developers to explore and adapt, but they still fall behind more advanced tools like Cartesia or ElevenLabs when it comes to sounding natural and expressive. The newest voice tools are designed to respond quickly and speak in a smooth, lifelike manner, which is important for keeping children engaged during speech activities.

Creating a smooth voice interaction with a child takes more than just linking together speech recognition, thinking, and spoken replies. These real-time conversations also need to be carefully timed, with clear audio and natural cues so the back-and-forth feels easy and engaging.

Coordinating the Conversation

Creating a smooth voice interaction with a child takes more than just linking together speech recognition, thinking, and spoken replies. These real-time conversations also need to be carefully timed, with clear audio and natural cues so the back-and-forth feels easy and engaging.

One promising setup uses a simple system behind the scenes to manage how the different parts of the voice interaction work together:

  • A fast speech-to-text engine for transcription (e.g., Deepgram).
  • A low-latency language model for generating responses (e.g., Gemini Flash).
  • A natural-sounding text-to-speech engine for reply synthesis (e.g., Cartesia).

The whole conversation cycle, from when a child speaks to when the system responds, happens in less than one second. This quick response helps the interaction feel smooth and natural, like a real back–and–forth exchange. Behind the scenes, a system like PipeCat helps manage the conversation. It listens for when the child has finished talking, avoids getting confused by its own voice, and knows what to do if something goes wrong. It can also trigger things like animations or games based on what the child says, making the experience more fun and engaging.

These systems are not just technically sophisticated. They are purpose-built to handle speech from children with a wide range of fluency levels. Voice activity detection helps the system know when a child has finished speaking. Echo suppression ensures that the model does not confuse its own voice output with a user’s input, an essential distinction when using synthesized parent voices.

These tools are also designed to protect children’s voice data and meet strict healthcare privacy standards.

These systems are not just technically sophisticated. They are purpose-built to handle speech from children with a wide range of fluency levels.

Unanswered Questions About AI And Speech

Despite technical advances, several foundational questions remain.

Can these systems really learn to recognize what a child actually said, even if it sounds unclear or unexpected? How do developers make sure that tools built for adult speech do not misunderstand the unique ways young children speak? And what kinds of privacy protections are needed when using voice data from children?

There are also important questions about how these tools should support children’s learning. How should systems deliver feedback? Should they correct the child right away, offer a gentle nudge, or simply model the correct sound without saying anything was wrong? Does the child’s emotional response change depending on whether the feedback comes from a computer voice or a familiar caregiver’s voice? And can these systems be designed to reflect different languages and cultures in ways that promote fairness and inclusion? 

Researchers are beginning to explore these questions through qualitative testing, feedback loops with therapists, and side-by-side comparisons of traditional and AI-augmented sessions. Results are still early, but promising.

Designing for Listening

Voice technology is not new. But designing it for children with speech delays is something entirely different. These children need to be listened to with precision, patience, and empathy. It demands not just high-performance models, but thoughtful orchestration and ethical design.

What is emerging from this space is not a replacement for therapists, but a companion system that supports families, frees up professional time, and encourages more frequent practice through emotionally resonant feedback.

In the end, helping a child find their voice requires more than technical fluency. It requires building systems that can listen not just to what is said, but to how it is said, and why it matters.

If AI can begin to do that, it might help ensure that more children are truly heard.

Amy Quarkume

Founder/CEO, Worlds of Hello

Dorothy Oteng 

Chief Research Officer, Worlds of Hello

Twitter Linkedin
Previous Post
Next Post

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Contact Us

General Inquiries

info@the-learning-agency.com

Media Inquiries

press@the-learning-agency.com

X-twitter Linkedin

Mailing address

The Learning Agency

700 12th St N.W

Suite 700 PMB 93369

Washington, DC 20002

Stay up-to-date by signing up for our weekly newsletter

© Copyright 2025. The Learning Agency. All Rights Reserved | Privacy Policy

Stay up-to-date by signing up for our weekly newsletter