Large Language Models Need Help To Do Math

May 31, 2024

Daniel Jarratt

In current artificial intelligence research, there’s growing excitement about the potential to revolutionize math education using new tools like Khan Academy’s Khanmigo. These tools promise personalized learning experiences and innovative teaching methods, heralding an education “revolution.” However, a fundamental challenge arises from the fact that the most popular A.I. technologies, notably large language models (LLMs), struggle with many mathematical tasks. To truly harness the power of A.I. in math education, it’s crucial to understand these limitations and how scientists and engineers have successfully created LLM interfaces to calculator-style systems.

Understanding LLMs and Their Limitations

Large language models are currently the most popular part of A.I. architectures because they allow easy natural language interaction with the computer. LLMs and large multimodal models (which consider other types of information like audio, video, and images) are statistical systems, a family of computation methods that are fundamentally different from the sort of computation used in formal mathematics. Because we generally want A.I.-powered systems like chatbots to be able to make calculations, it’s important to understand why that won’t happen directly using today’s model training methods.

LLMs and related foundational models are built on vast datasets of human language and rely on statistical patterns to generate responses and predictions. Algorithms look through this training data to mine for correlations between small data elements (called “tokens”) and then store these patterns in large data structures called models. This means the models’ knowledge is limited to what they saw in the training dataset, which is true for all statistical models. Their outputs are predictions or samples drawn from the underlying statistical model, answering the question “what’s the next most likely output token given the current input?”. This is appropriate for “fuzzy” outputs such as human language, in which every word has a highly complex set of meanings that vary according to context and over time. Language is constantly constructed through the act of communication and its truthfulness and usefulness is based on whether the communication was successful. Language models are good at capturing this.

LLMs showed impressive early math results because the initial training data included some math. However, statistical relationships between sequences of tokens is not the same as doing math.

LLMs showed impressive early math results because the initial training data included some math. However, statistical relationships between sequences of tokens is not the same as doing math. There is a fundamental distinction between predicting the next token in a sequence and computing the result of a problem. It means that today’s foundational models cannot do math formally without help.

Doing Computational Mathematics

Mathematics is built on logical principles that are provable, starting from the most basic and then building and combining principles into axiomatic systems. Much of math is deterministic (that is, it doesn’t consider or need uncertainty) because of how its elements are constructed. 2 + 3 = 5 not because we’ve seen that string of characters many times before, but because that is a statement that can be formally proven and will never vary even if you provide a million claims to the contrary. Computers work in this way because they use binary information and algorithms to perform functions (e.g., multiplication or division by shifting bits) and each algorithm must be formally proven. That’s how we know computers and calculators will do what we design them to do, every time. Mathematicians and computer scientists call these “proofs.” Computational mathematical systems, also called symbolic systems, use this type of logic to solve math problems. You can use a symbolic system right now at Wolfram|Alpha. The outputs you receive will not be predictions. They will be solutions.

Why are these different? Mathematics and statistics are like languages in many ways; you can use their symbols and rules to communicate, build, and deduce. They are expressive. They expand your capabilities. However, just like human language is vast but an individual book is narrow, so too is statistics vast but an individual statistical model narrow. Models are necessarily reductive; each is a single lens through which to see the universe. They focus on the most important relationships they see, even though the large models can capture many useful patterns, and they always rely on training data and algorithm choices.

Because mathematics can make essentially infinitely and arbitrarily complex statements – since it is a language and system with which we can construct anything – asking LLMs to do math means asking them to make statistical predictions for hard non-statistical problems that no human has ever thought about, much less written down at scale for LLM training. Neither the data nor the algorithm can do this. Computer scientists are progressing toward improved reasoning skills in LLMs, and the strong relationship between reasoning and mathematics has been helpful for math-related outputs, but this has not yet produced the level of quality improvement required for mathematics in high-precision domains like education. We cannot count on language or multimodal models alone for provably correct deductive inferences to arbitrary hard problems (read more on this topic here).

LLMs Belong in Education

It remains a worthwhile area of research: LLM capabilities are useful around math since people use human language with math all the time. They offer specific language-based advantages that have the potential to significantly improve math education. These include individualization of curricular, feedback, and assessment items, such as creating engaging materials in students’ primary languages and situating those materials within students’ individual educational pathways. Chatbots can be quite friendly and offer supportive messages to math students, and because they are machines, chatbots offer just-in-time support even when human help is not available.

Chatbots can be quite friendly and offer supportive messages to math students, and because they are machines, chatbots offer just-in-time support even when human help is not available.

But to implement an A.I. system in a math education setting, its math outputs must be verifiably correct in the same manner as math textbooks go through quality control processes. If LLMs actually do produce high-quality instructional materials, then we can leverage these capabilities to generate step-by-step explanations, Socratic tutoring, unlimited practice problems, tailored feedback, and error pattern identification, among many other potential use cases. But because LLMs do not just create high-quality instructional materials or correct mathematics out of the box, the education and research community can work with technologists to ascertain whether these are possible.

Combining LLMs and Symbolic Systems for Enhanced Math Education

The key to enhancing A.I. for math education lies in combining the strengths of LLMs and symbolic systems. While LLMs can offer personalized learning experiences, feedback, and engagement through human language interaction, symbolic systems provide the foundation for accurate mathematical computations and logical reasoning. Integrating these components can lead to the development of neurosymbolic systems, which blend language understanding with mathematical precision.

The good news is that A.I. systems are already modular; large foundational models are only a portion of what makes A.I. work. Systems like ChatGPT also include adjustments that make model outputs appropriate for dialogue (which is how chatbots are made), trust and safety protections, knowledge integration from other sources, personalization, memory and other context information, and performance optimizations. Advanced foundational models don’t have to do all the work! They’re a necessary but not sufficient part of modern A.I. It turns out that we can add calculators (which are correct and cheap) to complement LLMs (which are often wrong and currently more expensive). Symbolic systems and neural systems (another name for the big statistical language models) come together in neurosymbolic systems. A common path is to train LLMs to output computer code, such as Python or the Wolfram Language, execute that code in a protected environment, and return the results back to the LLM to display to the user. There is still a math correctness risk since the conversion of human language or queries into symbolic code remains probabilistic, but it’s far easier to see how the symbolic system solved the math problem and debug potential issues. One of the very first plugins for ChatGPT came from Wolfram Research, bringing the symbolic capabilities from Wolfram|Alpha to OpenAI’s chatbot.

We should continue evaluating foundational models based on their mathematics capabilities, but we should not expect their performance to achieve perfection on arbitrary math queries. Instead, we should design chatbots and other A.I.-powered systems with neurosymbolic interfaces. We can then evaluate A.I. progress based on to what extent the LLM took advantage of its tools. And that’s the most human question of all.