Table of Contents
Fetching ...

Can LLMs $\textit{understand}$ Math? -- Exploring the Pitfalls in Mathematical Reasoning

Tiasa Singha Roy, Aditeya Baral, Ayush Rajesh Jhaveri, Yusuf Baig

TL;DR

This work tackles the gap between final-answer accuracy and the underlying mathematical reasoning quality of large language models. It introduces a three-stage evaluation framework that uses self-reflection, a Judge LLM, and a novel MAPLE score to quantify reasoning misalignment by aggregating error frequencies, redundancy, and validity. Applying this framework to the MATH dataset across multiple models reveals distinct error patterns and shows that MAPLE can reveal reasoning weaknesses not captured by accuracy alone, with model- and topic-dependent trends. The results offer a systematic, holistic approach to diagnosing and guiding improvements in LLM-based mathematical problem solving, with implications for evaluation protocols and model development.

Abstract

Large language models (LLMs) demonstrate considerable potential in various natural language tasks but face significant challenges in mathematical reasoning, particularly in executing precise, multi-step logic. However, current evaluation frameworks judge their performance solely based on accuracy, which only accounts for the final answer. This study explores these pitfalls by employing a novel evaluation framework. We propose an evaluation metric called the MAPLE score, which holistically quantifies reasoning misalignment by integrating error rates, redundancy, and validity.

Can LLMs $\textit{understand}$ Math? -- Exploring the Pitfalls in Mathematical Reasoning

TL;DR

This work tackles the gap between final-answer accuracy and the underlying mathematical reasoning quality of large language models. It introduces a three-stage evaluation framework that uses self-reflection, a Judge LLM, and a novel MAPLE score to quantify reasoning misalignment by aggregating error frequencies, redundancy, and validity. Applying this framework to the MATH dataset across multiple models reveals distinct error patterns and shows that MAPLE can reveal reasoning weaknesses not captured by accuracy alone, with model- and topic-dependent trends. The results offer a systematic, holistic approach to diagnosing and guiding improvements in LLM-based mathematical problem solving, with implications for evaluation protocols and model development.

Abstract

Large language models (LLMs) demonstrate considerable potential in various natural language tasks but face significant challenges in mathematical reasoning, particularly in executing precise, multi-step logic. However, current evaluation frameworks judge their performance solely based on accuracy, which only accounts for the final answer. This study explores these pitfalls by employing a novel evaluation framework. We propose an evaluation metric called the MAPLE score, which holistically quantifies reasoning misalignment by integrating error rates, redundancy, and validity.

Paper Structure

This paper contains 19 sections, 2 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Architecture of our LLM Agent evaluation and identification of errors. The LLM's generated answer is evaluated in a multi-turn set-up to identify the failing points in the generated response using self-reflection and clustering.
  • Figure 2: Architecture of our Judge LLM Agent and MAPLE Score. The Judge LLM provides step-wise analysis to compute MAPLE score using label-frequencies and label-weights.
  • Figure 3: Comparison of LLM performance across difficulty levels on the MATH Dataset. Level 1 represents the easiest and Level 5 represents the toughest math problems. We observe a correlation between final answer accuracy and the degree of incorrectness represented by the MAPLE score.
  • Figure 4: Comparison of accuracy of the LLM as a Judge in predicting error labels for generated solutions. We observe that most predictions match human annotations for a representative sample of 105 evenly-distributed examples across difficulty levels and topics.
  • Figure 5: Comparison of LLM performance across math topics on the MATH Dataset. We observe that most models perform better at easier topics such as geometry while underperforming at tougher topics such as calculus.