The AI Mirror: How GenAI Reflects and Amplifies Gaps in Early Math Expertise

June 11, 2025

Anastasia Betts

As a learning scientist and expert in early mathematics, I’ve spent the past year in an unexpected partnership – teaching AI systems about how young children learn math. First, let me share a little background: my work centers on developing content for PAL.guide, a digital tool that helps parents and teachers support children’s early math development through hands-on activities. Each activity aligns with a specific knowledge unit (KU) and is designed to produce observable evidence of that skill. Adults provide feedback to PAL on whether the child completed the activity independently, with help, or could not do it – data the system uses to adjust each child’s personalized learning trajectory. This level of granularity also supports non-experts: by focusing on just one discrete concept or skill at a time, it becomes easier for parents and teachers to provide accurate, meaningful feedback.

In working with large language models (LLMs) for PAL, I discovered they often present poor math understanding with the polish of expertise. These weren’t one-off errors – they were persistent patterns across platforms (e.g., Claude, ChatGPT, Llama, Mistral, etc.) and math domains. From numeracy to measurement to geometry and beyond, LLMs routinely crossed developmental boundaries, confused concepts, and misaligned activities – often even after being repeatedly corrected.

Why does this matter? Because these AI systems are increasingly becoming the go-to resource for the very educators and parents who need the most support in early math, yet parents and even teachers often lack the specialized knowledge to detect these subtle but critical errors. When a preschool teacher who is struggling with math confidence turns to ChatGPT for help, or when a parent uses AI to generate activities for their 4-year-old, they not only risk getting flawed content in return, but they’re also getting it delivered with AI-generated authority and confidence that make these tools so appealing.

Some think using an AI-powered educational app specifically designed for early math would yield more reliable results. Unfortunately, most AI-embedded educational products are simply general-purpose LLMs with an educational-looking interface layered on top. Very few of these ed tech tools have been trained on rigorously vetted, research-based knowledge models for early math development, largely because such datasets barely exist, and when they do, they’re often proprietary and themselves insufficiently rigorous (a problem I have written about extensively here and here).

The result is a perfect storm: those least equipped to spot mathematical misconceptions are the most likely to implement them, potentially establishing foundational confusions that persist throughout a child’s mathematical development. What’s more troubling is that AI isn’t inventing these problems; it’s reflecting and amplifying the gaps that already exist in our collective understanding of early math. In essence, AI has become a mirror, revealing not machine failure, but deeper human expertise gaps in how we understand, design, and deliver early math learning.

Working with AI: A Case Study About Early Measurement

While the full set of examples from over a thousand hours of work is too extensive to share, one revealing case study highlights the systemic nature of these challenges. In the annotated exchange that follows, I worked with ChatGPT (one of several GenAIs I tested) to generate activities for a seemingly simple measurement concept. The conversation illustrates how AI can confidently propose activities that appear sound yet miss key developmental and mathematical nuances.

The Knowledge Unit (KU) in Focus: Shows awareness of size, and uses appropriate language to describe such as big, small, giant, tiny, etc.

In our knowledge model, this KU is among the earliest in the measurement progression for 4–5-year-olds. Yet as I worked with the AI to develop aligned activities, its responses repeatedly targeted more advanced concepts, despite clear guidance on the intended skill (using simple language to describe size). The exchange revealed how even well-written AI content can subtly undermine early learning by skipping developmental steps in early math progressions.

The result is a perfect storm: those least equipped to spot mathematical misconceptions are the most likely to implement them, potentially establishing foundational confusions that persist throughout a child's mathematical development.

The Initial Misalignment

When asked to create activities for this basic awareness skill, ChatGPT immediately jumped several developmental steps ahead:

Me: Let’s try the next KU, which is “Shows awareness of size, and uses appropriate language to describe such as big, small, giant, tiny, etc.”

ChatGPT: [First activity suggestion]

Size Sort

Activity Description: Gather household objects of different sizes (e.g., a spoon, a book, a pillow). Help {{learner_name}} sort them into groups by size: big, medium, and small.

This initial suggestion reflects a fundamental misunderstanding: sorting requires a more advanced skill than simply recognizing size. Notably, the GenAI had already reviewed the full KU progression and discussed where this foundational KU fits, well before sorting, classifying, or ordering appear later in the sequence.

Recognizing the Error – But Not Correcting It

Following the AI’s initial response, I pointed out the mismatch between the activity and the intended developmental target:

Me: Let’s analyze how you did. The KU says this: “Shows awareness of size, and uses appropriate language to describe such as big, small, giant, tiny, etc.” The very first activity you suggested involved “sorting” which is a very different depth of understanding than just showing awareness.

ChatGPT: You’re absolutely right! The activity I suggested, “Size Sort,” moves beyond simple awareness and delves into categorizing or sorting, which implies a more developed understanding. For this KU, we should focus on recognition and description of size using language, rather than higher-level tasks like sorting or organizing.

Although ChatGPT acknowledged the error and correctly described the KU’s focus, it continued to produce similar developmental misalignments, showing that recognizing a mistake doesn’t guarantee a change in a GenAI’s output. I responded by re-sharing the full measurement progression to reinforce where this KU sits in the sequence:

Me: I also think it might help to again see the actual progression of KUs in the Measurement category so you know where this one fits in: Shows awareness of size, and uses appropriate language to describe such as big, small, giant, tiny, etc. Compares differences in size (two objects)… Sorts a group of objects by size… Orders objects according to size (e.g., smallest to biggest, etc.)… [the entire progression was provided]

ChatGPT: Seeing the full progression of KUs clarifies the developmental pathway within the Measurement category. Here’s how “Shows awareness of size, and uses appropriate language” fits:

Foundational Skill:

This KU is the very first step, laying the groundwork for more complex skills like comparison, sorting, and measuring.
The focus is on basic recognition and descriptive vocabulary, without requiring higher-level reasoning or actions.

Despite repeated feedback and access to the full framework, the AI continued to propose activities misaligned with the KU’s intended developmental level.

ChatGPT: Activity 3: Nature’s Comparison

Activity Description: Gather three objects outside: a big stick, a medium stick, and a tiny stick. Ask {{learner_name}} to place them in order from big to small. Then, encourage them to describe the differences: “Which one is the biggest? Which is the tiniest? How do you know?”

Me: OK here’s another problem: “Gather three objects outside: a big stick, a medium stick, and a tiny stick. Ask {{learner_name}} to place them in order from big to small…” While this is a good activity, this is more appropriate for a different KU. If you review the list of KUs I provided to you, you will find the KU that this more directly addresses.

ChatGPT: You’re absolutely correct – this activity aligns more directly with the KU: “Orders objects according to size (e.g., smallest to biggest, etc.)”

Even after repeated correction and explicit developmental framing, the AI continued producing activities aimed at more advanced skills. This wasn’t a misunderstanding of instructions; it was a deeper reflection of how difficult it is, for both humans and machines, to internalize fine-grained developmental progressions in early math. And this was not an isolated case. The same pattern emerged across many domains and KUs: a surface-level fix, followed by another misstep.

AI’s Concept Conflation and Vocabulary Precision Issues

Beyond developmental misalignment, ChatGPT frequently conflated mathematical concepts, such as size and length, and misused vocabulary that’s critical in early math learning:

ChatGPT: [In revised activity] Finding big and tiny objects helps {{learner_name}} focus on size differences and practice using size words. Encourage them to describe what they notice: “The tree is tall and wide, but the flower is very short and small.”

Me: This is a problem now, because you are conflating the use of short and small… short is an aspect of length (which you have seen is in another strand of KUs), and small is an aspect of size. You are saying this to the parent, which feels like we are giving bad or confusing guidance – conflating aspects of size and length, when we are meant to be teaching the parent as well.

ChatGPT: You’re absolutely right again! The example I provided creates confusion by mixing short and small without clarifying their intended use or the nuance between them. This undermines the goal of building precise language skills and mathematical understanding and could confuse both the child and the parent.

The distinction between size words (big/small) and length words (tall/short/wide) is critical. These describe different mathematical attributes that children must learn to differentiate. When “big” and “tall” are treated as interchangeable, several problems arise:

Measurement Confusion: Size may refer to overall dimensions, volume, or mass, while length specifically refers to one-dimensional measurement. Each requires a different mental model, tool, and unit.
Classification Errors: Math depends on distinguishing attributes clearly. When children confuse size and length, they struggle with categorization – an essential skill for reasoning.
Problem-Solving Mistakes: For example, a child may assume a taller container holds more water, even if it’s narrower. Without clear conceptual distinctions, errors become systematic.
Intervention Challenges: If a child understands one attribute but not the other, educators must know which concept to support. Conflation makes targeted help nearly impossible.

These distinctions form the foundation for all later mathematical development, including area vs. perimeter in elementary school, volume vs. surface area in middle school, and multi-dimensional measurement in high school geometry and calculus. We’re not just teaching vocabulary – we’re helping children build the mental structures that organize future mathematical thinking. When AI systems blur these distinctions in early learning activities, they risk planting misconceptions that persist and compound over time.

Even after repeated correction and explicit developmental framing, the AI continued producing activities aimed at more advanced skills. This wasn’t a misunderstanding of instructions; it was a deeper reflection of how difficult it is, for both humans and machines, to internalize fine-grained developmental progressions in early math.

Patterns of Misunderstanding: What GenAI Reveals About Us

Throughout this project, I encountered a wide range of persistent mathematical misunderstandings. In number sense alone, GenAIs frequently confused verbal count sequence with one-to-one correspondence, suggested backward counting when asked for forward skip counting by 10s to 50, and offered numeral recognition activities when the KU targeted oral counting. They also conflated part-part-whole relationships with conservation of number in different arrangements, or asked learners to write numbers from memory (e.g., “close your eyes and draw the number in the air”) despite the KU explicitly calling for guided tracing support.

Problems emerged consistently across domains and contexts. In geometry, GenAIs often confused different types of symmetry, proposing activities focused on pattern copying (translation symmetry) when the goal was reflection symmetry. They consistently blurred conceptual lines between subdomains of measurement, including conflating attributes of length and size (as shown in the case study), but also weight and capacity. When prompted to design activities that were authentically situated in the lived experiences of young children, the AI frequently defaulted to clichés (e.g., crackers and blocks) or created impractical scenarios. One particularly revealing example involved an activity requiring 50 to 100 small food items during a routine meal, completely disconnected from the realities of family life.

When prompted to design activities that were authentically situated in the lived experiences of young children, the AI frequently defaulted to clichés (e.g., crackers and blocks) or created impractical scenarios. One particularly revealing example involved an activity requiring 50 to 100 small food items during a routine meal, completely disconnected from the realities of family life.

These issues weren’t isolated to a single KU, activity, AI system, or mathematical domain. Across multiple GenAIs, the same misunderstandings appeared repeatedly, mirroring the kinds of confusion that already exist in early childhood education.

The Triple Expertise Gap

As I’ve reflected on my work with LLMs, I’ve come to recognize what I call the “triple expertise gap” in early mathematics – a cascading problem that begins with misconceptions about the complexity of early math and extends through the systems we rely on to support learning.

Gap	What it is	How it involves Expertise	Consequences
Practitioner Gap	Many early childhood educators lack deep, domain-specific knowledge of early math	A direct gap in practitioner expertise caused by limited training and support	Misaligned instruction; reliance on AI tools without the ability to spot errors
Perception Gap	Early math is widely perceived as “simple” or “intuitive”	A societal undervaluing of early math expertise; leads to underinvestment	Weak professional development systems, lack of research-based curriculum, poor training data for AI
Infrastructure Gap	There are too few true specialists in early math to build and scale high-quality tools and resources	A gap in disciplinary and systems-level expertise	EdTech companies and AI developers unintentionally build tools that replicate human misunderstandings

The first gap is foundational: most early childhood educators lack consistent, domain-specific knowledge in early mathematics. This isn’t a failing of individual educators – it’s a systemic failing that spans teacher preparation, professional development, and educational priorities. In a recent study, preschool teachers showed uneven accuracy in identifying age-appropriate math skills for 4-year-olds: 66 percent for numeracy, 70 percent for patterning, and 73 percent for spatial and geometric thinking. Contrary to common assumptions, numeracy, not geometry, was the weakest area. These disparities reflect what systems prioritize: only 25 percent of teachers reported receiving substantial professional development in early math. In another critical study, pre-K teachers’ judgments were less accurate for math skills compared to early literacy skills, with only 66% agreement between teacher ratings and direct assessments of children’s math abilities, versus 71% agreement for language and literacy. This suggests that pre-K teachers may be less attuned to their students’ developing math competencies when compared to literacy.

The second gap reinforces the first: the widespread belief that early math is “simple” leads to chronic underinvestment in curriculum and teacher preparation. When adults see young children naming shapes or counting to 10, they often think, “How hard can this be?” But these foundational skills – like one-to-one correspondence or spatial reasoning – form the cognitive architecture for all later mathematics. Misunderstandings at this level don’t vanish; they compound over time, creating learning challenges that are much harder to remediate.

When GenAI suggests an activity that conflates size and length – describing a flower as “short and small” or a tree as “tall and wide” – it’s not just making a technical mistake. It’s echoing the same conceptual confusions that arise when humans lack foundational expertise.

What makes this especially problematic is that surface-level fluency often hides deeper conceptual gaps. A child might count to 20 or name shapes without truly understanding what those numbers represent or how shapes relate to one another. These gaps tend to emerge later, when advanced concepts reveal the fragile foundation underneath. And without specialized training, adults often reinforce the problem, celebrating rote counting without comprehension or rushing into numeral writing before children are spatially ready. Each subtle misunderstanding becomes a future roadblock.

The third gap completes the cycle: true expertise in early math is rare. Unlike early literacy, which benefits from decades of research and training infrastructure, early math remains a niche specialization. This scarcity extends beyond classrooms, into the AI systems now being trained on our fragmented and imprecise knowledge base.

When GenAI suggests an activity that conflates size and length – describing a flower as “short and small” or a tree as “tall and big” – it’s not just making a technical mistake. It’s echoing the same conceptual confusions that arise when humans lack foundational expertise. “Short” refers to linear measurement; “small” refers to overall size. These distinctions require entirely different mental models, and they matter deeply for later topics like area vs. perimeter or volume vs. surface area.

Because AI is trained on flawed human knowledge, it doesn’t just reflect these gaps – it reinforces and scales them. Unless we invest in better foundational infrastructure, we are teaching our machines to mirror our misunderstandings – and they’re teaching them back to us.

What makes this problematic is that both humans and AI systems deliver guidance with equal confidence. That authority masks misconceptions. A flawed activity, presented clearly and confidently, often passes as credible, even when it disrupts the developmental logic children need to build understanding.

In the age of AI, this becomes a perfect storm. Those least confident in early math are the most likely to turn to GenAI for help – and the least likely to spot its mistakes. Because AI is trained on flawed human knowledge, it doesn’t just reflect these gaps – it reinforces and scales them. Unless we invest in better foundational infrastructure, we are teaching our machines to mirror our misunderstandings – and they’re teaching them back to us.

The Challenge and the Opportunity

As generative AI tools become more embedded in everyday educational practice, parents and teachers, especially those with the least training in early math, are increasingly relying on them for support. Most won’t be building custom GPTs or fine-tuned models. They’ll use what’s available off the shelf. And troublingly, so will many ed tech companies that are often unaware that they’re training tools on flawed foundations that reproduce long-standing misconceptions in early math.

This is the challenge. But it’s also an opportunity.

The AI mirror doesn’t just reveal gaps in machine understanding; it exposes deep cracks in our human knowledge infrastructure. Through the field of learning engineering, we’re uniquely positioned to address this. We can build high-quality, research-grounded datasets and knowledge models that reflect what we actually know about how children learn math. We can design tools that validate AI-generated content against research-grounded developmental progressions. And we can strengthen the feedback loops between research, classroom practice, and AI training.

If we want AI to be part of the solution in early math education, we must first ensure that it is learning from the best data we can provide, even if we have to newly create those datasets. That means investing in the infrastructure, human and technical, that will guide machines and educators toward deeper, more accurate understanding. This is not just a call to fix AI. It’s a call to fix what AI has revealed: the urgent need for better shared knowledge about how young children build their earliest mathematical foundations.

Anastasia Betts

Executive Director, Learnology Labs

Articles by guest or contributing authors do not necessarily reflect the views of The Learning Agency, our clients, or our funders.