Can LLMs $\textit{understand}$ Math? -- Exploring the Pitfalls in Mathematical Reasoning
Tiasa Singha Roy, Aditeya Baral, Ayush Rajesh Jhaveri, Yusuf Baig
TL;DR
This work tackles the gap between final-answer accuracy and the underlying mathematical reasoning quality of large language models. It introduces a three-stage evaluation framework that uses self-reflection, a Judge LLM, and a novel MAPLE score to quantify reasoning misalignment by aggregating error frequencies, redundancy, and validity. Applying this framework to the MATH dataset across multiple models reveals distinct error patterns and shows that MAPLE can reveal reasoning weaknesses not captured by accuracy alone, with model- and topic-dependent trends. The results offer a systematic, holistic approach to diagnosing and guiding improvements in LLM-based mathematical problem solving, with implications for evaluation protocols and model development.
Abstract
Large language models (LLMs) demonstrate considerable potential in various natural language tasks but face significant challenges in mathematical reasoning, particularly in executing precise, multi-step logic. However, current evaluation frameworks judge their performance solely based on accuracy, which only accounts for the final answer. This study explores these pitfalls by employing a novel evaluation framework. We propose an evaluation metric called the MAPLE score, which holistically quantifies reasoning misalignment by integrating error rates, redundancy, and validity.
