The Failure of LLMs in Math and How to Solve For It

Mathematics has always posed a significant challenge for AI models. Mastering math requires complex reasoning skills, and for AI, this task is anything but straightforward.  That creates a huge problem given the importance  of mathematical proficiency for professional, personal, and academic success.

Despite their remarkable abilities, large language models (LLMs) often struggle with complex mathematical tasks, such as geometry, that demand advanced reasoning skills.  This brings us to the critical question: how much of an AI model’s mathematical ability stems from genuine reasoning vs. mere recall of training data?

Recent findings from Apple show that even when focused on grade school math word problems, the most sophisticated of models are not completely driven by “reasoning.”

Taking this one step further, the R&D team at MathGPT.ai shed new light on areas of algebra to calculus level math that require the most improvement.

This data explored how variations in problem context and language affect model performance across different LLMs, including OpenAI’s latest o1-preview and o1-mini models. The findings revealed a concerning trend: accuracy consistently declined as problems deviated from original questions available in the training data of the LLMs, with performance falling steeply on more challenging mathematical benchmarks above the Grade school math level. 

The Recall vs. Reasoning Dilemma

The investigation focused on three key factors:

  1. Using more challenging mathematical benchmarks than Grade school math
  2. Exploring a “1-shot prompt” with extreme closeness to the test problem
  3. Implementing a “best of n” strategy for n attempts at the same problem – effectively a majority voting to eliminate statistical  anomalies, at inference time. 

The results were both intriguing and concerning. Boundaries of problem variation were pushed, which showed a consistent decline in AI model performance as the mathematical equations became more complex.

The MATH Dataset Challenge

The MATH dataset was deployed, known for its challenging high-school-level problems, as opposed to the Grade School Math 8K dataset, which contains 8,500 linguistically diverse elementary-level problems. The MATH dataset presents more challenging high school level questions to examine model performance across varying difficulty levels, from pre-algebra to number theory. This choice allowed MathGPT.ai to better examine model performance across varying difficulty levels.

In testing, while numerical values and final answers remained unchanged, we varied the language, variables, and context of the problems.  For instance, a “dog walking” scenario might be transformed into a “dishwasher” problem. This method helped mitigate the increased complexity of the MATH dataset while still challenging the models’ reasoning abilities.

Revealing Results

The results were striking. Even the most advanced models struggled when faced with variations of problems they had likely encountered in their training data. For example, its o1-mini model’s accuracy fell from 93.66% on original questions to 88.54% on the most challenging variation. The o1-preview model experienced a similar decline, dropping from 91.22% to 82.93% —  — a sharp enough drop to highlight critical gaps in their robustness.

These findings align with and build on Apple’s earlier research, demonstrating that the limitations in AI’s mathematical reasoning become more apparent as problems grow more complex and require deeper understanding rather than pattern recognition.

The Path Forward

As we continue to push the boundaries of LLM reasoning, it’s crucial to recognize both its incredible potential and  current limitations. New research underscores the need for continued innovation in developing AI models capable of moving beyond pattern recognition to achieve more robust and generalizable problem-solving skills.

This comes at a critical time, especially in higher education, where AI is being used more heavily as an instructor’s aid in the classroom while also schools continue to see high failure rates among math students who are unprepared for courses.

Achieving human-like cognitive capabilities or general intelligence in AI demands not only technological advancements but also a nuanced understanding of how to bridge the gap between recall and true reasoning. 

If we’re successful on this path, I’m confident we can change the lives of millions of students and even professionals to put their lives on an entirely new trajectory.

The Failure of LLMs in Math and How to Solve For It

Related articles

Introductory time-series forecasting with torch

This is the first post in a series introducing time-series forecasting with torch. It does assume some prior...

Does GPT-4 Pass the Turing Test?

Large language models (LLMs) such as GPT-4 are considered technological marvels capable of passing the Turing test successfully....