Skip to content
The Learning Agency
  • Home
  • About
    • About Us
    • Our Team
    • Our Openings
  • Our Work
    • Services
    • Case Studies
    • Competitions
      • RATER Competition
    • Reports & Resources
    • Newsroom
  • The Cutting Ed
  • Learning Engineering Hub
The Learning Agency
  • About Us
  • Case Studies
  • Elementor #4332
  • Home
  • Insights
  • Learning Engineering Hub
    • About
    • Introduction To Learning Engineering
    • Key Research
    • Learning Engineering Hub – Academic Programs
    • Learning Engineering Hub – Build
    • Learning Engineering Hub – Contact
    • Learning Engineering Hub – Home
    • Learning Engineering Hub – Key Research
    • Learning Engineering Hub – Learn
  • News & Insights
  • News & Insights Archives
  • Newsroom
  • Our Openings
  • Our Team
  • Privacy Policy
  • Reports and Resources
  • Robust Algorithms for Thorough Essay Rating (RATER)
    • Competition Data
    • Competition Leaderboard
    • Competition Overview
    • Competition Rules
    • Csv Dashboard
    • Submissions
  • Services
    • The Learning Agency’s Educator Insight Panel
  • The Cutting Ed
  • Upload-csv

Large Language Models Need Help To Do Math

The Cutting Ed
  • May 31, 2024
Daniel Jarratt

In current artificial intelligence research, there’s growing excitement about the potential to revolutionize math education using new tools like Khan Academy’s Khanmigo. These tools promise personalized learning experiences and innovative teaching methods, heralding an education “revolution.” However, a fundamental challenge arises from the fact that the most popular A.I. technologies, notably large language models (LLMs), struggle with many mathematical tasks. To truly harness the power of A.I. in math education, it’s crucial to understand these limitations and how scientists and engineers have successfully created LLM interfaces to calculator-style systems.

Understanding LLMs and Their Limitations

Large language models are currently the most popular part of A.I. architectures because they allow easy natural language interaction with the computer. LLMs and large multimodal models (which consider other types of information like audio, video, and images) are statistical systems, a family of computation methods that are fundamentally different from the sort of computation used in formal mathematics. Because we generally want A.I.-powered systems like chatbots to be able to make calculations, it’s important to understand why that won’t happen directly using today’s model training methods.

LLMs and related foundational models are built on vast datasets of human language and rely on statistical patterns to generate responses and predictions. Algorithms look through this training data to mine for correlations between small data elements (called “tokens”) and then store these patterns in large data structures called models. This means the models’ knowledge is limited to what they saw in the training dataset, which is true for all statistical models. Their outputs are predictions or samples drawn from the underlying statistical model, answering the question “what’s the next most likely output token given the current input?”. This is appropriate for “fuzzy” outputs such as human language, in which every word has a highly complex set of meanings that vary according to context and over time. Language is constantly constructed through the act of communication and its truthfulness and usefulness is based on whether the communication was successful. Language models are good at capturing this.

LLMs showed impressive early math results because the initial training data included some math. However, statistical relationships between sequences of tokens is not the same as doing math.

LLMs showed impressive early math results because the initial training data included some math. However, statistical relationships between sequences of tokens is not the same as doing math. There is a fundamental distinction between predicting the next token in a sequence and computing the result of a problem. It means that today’s foundational models cannot do math formally without help.

Doing Computational Mathematics

Mathematics is built on logical principles that are provable, starting from the most basic and then building and combining principles into axiomatic systems. Much of math is deterministic (that is, it doesn’t consider or need uncertainty) because of how its elements are constructed. 2 + 3 = 5 not because we’ve seen that string of characters many times before, but because that is a statement that can be formally proven and will never vary even if you provide a million claims to the contrary. Computers work in this way because they use binary information and algorithms to perform functions (e.g., multiplication or division by shifting bits) and each algorithm must be formally proven. That’s how we know computers and calculators will do what we design them to do, every time. Mathematicians and computer scientists call these “proofs.” Computational mathematical systems, also called symbolic systems, use this type of logic to solve math problems. You can use a symbolic system right now at Wolfram|Alpha. The outputs you receive will not be predictions. They will be solutions.

Why are these different? Mathematics and statistics are like languages in many ways; you can use their symbols and rules to communicate, build, and deduce. They are expressive. They expand your capabilities. However, just like human language is vast but an individual book is narrow, so too is statistics vast but an individual statistical model narrow. Models are necessarily reductive; each is a single lens through which to see the universe. They focus on the most important relationships they see, even though the large models can capture many useful patterns, and they always rely on training data and algorithm choices.

Because mathematics can make essentially infinitely and arbitrarily complex statements – since it is a language and system with which we can construct anything – asking LLMs to do math means asking them to make statistical predictions for hard non-statistical problems that no human has ever thought about, much less written down at scale for LLM training. Neither the data nor the algorithm can do this. Computer scientists are progressing toward improved reasoning skills in LLMs, and the strong relationship between reasoning and mathematics has been helpful for math-related outputs, but this has not yet produced the level of quality improvement required for mathematics in high-precision domains like education. We cannot count on language or multimodal models alone for provably correct deductive inferences to arbitrary hard problems (read more on this topic here).

LLMs Belong in Education

It remains a worthwhile area of research: LLM capabilities are useful around math since people use human language with math all the time. They offer specific language-based advantages that have the potential to significantly improve math education. These include individualization of curricular, feedback, and assessment items, such as creating engaging materials in students’ primary languages and situating those materials within students’ individual educational pathways. Chatbots can be quite friendly and offer supportive messages to math students, and because they are machines, chatbots offer just-in-time support even when human help is not available.

Chatbots can be quite friendly and offer supportive messages to math students, and because they are machines, chatbots offer just-in-time support even when human help is not available.

But to implement an A.I. system in a math education setting, its math outputs must be verifiably correct in the same manner as math textbooks go through quality control processes. If LLMs actually do produce high-quality instructional materials, then we can leverage these capabilities to generate step-by-step explanations, Socratic tutoring, unlimited practice problems, tailored feedback, and error pattern identification, among many other potential use cases. But because LLMs do not just create high-quality instructional materials or correct mathematics out of the box, the education and research community can work with technologists to ascertain whether these are possible.

Combining LLMs and Symbolic Systems for Enhanced Math Education

The key to enhancing A.I. for math education lies in combining the strengths of LLMs and symbolic systems. While LLMs can offer personalized learning experiences, feedback, and engagement through human language interaction, symbolic systems provide the foundation for accurate mathematical computations and logical reasoning. Integrating these components can lead to the development of neurosymbolic systems, which blend language understanding with mathematical precision.

The good news is that A.I. systems are already modular; large foundational models are only a portion of what makes A.I. work. Systems like ChatGPT also include adjustments that make model outputs appropriate for dialogue (which is how chatbots are made), trust and safety protections, knowledge integration from other sources, personalization, memory and other context information, and performance optimizations. Advanced foundational models don’t have to do all the work! They’re a necessary but not sufficient part of modern A.I. It turns out that we can add calculators (which are correct and cheap) to complement LLMs (which are often wrong and currently more expensive). Symbolic systems and neural systems (another name for the big statistical language models) come together in neurosymbolic systems. A common path is to train LLMs to output computer code, such as Python or the Wolfram Language, execute that code in a protected environment, and return the results back to the LLM to display to the user. There is still a math correctness risk since the conversion of human language or queries into symbolic code remains probabilistic, but it’s far easier to see how the symbolic system solved the math problem and debug potential issues. One of the very first plugins for ChatGPT came from Wolfram Research, bringing the symbolic capabilities from Wolfram|Alpha to OpenAI’s chatbot.

We should continue evaluating foundational models based on their mathematics capabilities, but we should not expect their performance to achieve perfection on arbitrary math queries. Instead, we should design chatbots and other A.I.-powered systems with neurosymbolic interfaces. We can then evaluate A.I. progress based on to what extent the LLM took advantage of its tools. And that’s the most human question of all.

Daniel Jarratt​

Chief Technical and Innovation Officer, Learning Engineering Virtual Institute

Twitter Linkedin
Previous Post
Next Post

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Contact Us

General Inquiries

info@the-learning-agency.com

Media Inquiries

press@the-learning-agency.com

Facebook Twitter Linkedin Youtube

Mailing address

The Learning Agency

700 12th St N.W

Suite 700 PMB 93369

Washington, DC 20002

Stay up-to-date by signing up for our weekly newsletter

© Copyright 2025. The Learning Agency. All Rights Reserved | Privacy Policy

Stay up-to-date by signing up for our weekly newsletter