Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning

Joykirat Singh; Akshay Nambi; Vibhav Vineet

Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning

Joykirat Singh, Akshay Nambi, Vibhav Vineet

TL;DR

This work introduces a novel dataset MWP-MISTAKE, incorporating MWPs with both correct and incorrect reasoning steps generated through rule-based methods and smaller language models, and highlights GPT-$o's superior performance in mistake detection and rectification and the persistent challenges faced by smaller models.

Abstract

Large Language Models (LLMs) have been applied to Math Word Problems (MWPs) with transformative impacts, revolutionizing how these complex problems are approached and solved in various domains including educational settings. However, the evaluation of these models often prioritizes final accuracy, overlooking the crucial aspect of reasoning capabilities. This work addresses this gap by focusing on the ability of LLMs to detect and correct reasoning mistakes. We introduce a novel dataset MWP-MISTAKE, incorporating MWPs with both correct and incorrect reasoning steps generated through rule-based methods and smaller language models. Our comprehensive benchmarking reveals significant insights into the strengths and weaknesses of state-of-the-art models, such as GPT-4o, GPT-4, GPT-3.5Turbo, and others. We highlight GPT-$o's superior performance in mistake detection and rectification and the persistent challenges faced by smaller models. Additionally, we identify issues related to data contamination and memorization, impacting the reliability of LLMs in real-world applications. Our findings emphasize the importance of rigorous evaluation of reasoning processes and propose future directions to enhance the generalization and robustness of LLMs in mathematical problem-solving.

Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning

TL;DR

Abstract

Paper Structure (23 sections, 4 figures, 12 tables)

This paper contains 23 sections, 4 figures, 12 tables.

Introduction
MWP-Mistake Dataset
Meticulously Crafted Rules to Programmatically Inject Errors
Smaller Models as Bad Reasoners
Experimental Setup
Results and Analysis
Question 1: Can LLMs Effectively Identify Mistakes in Reasoning Steps?
Can LLMs Accurately Derive Correct Answers Despite Mistakes?
Exploring Data Contamination and Memorization Effects in Math Reasoning Tasks
Can LLMs Correctly Rectify Mistakes in Reasoning Steps?
Key Insights, Takeaways, and Potential Directions for Improving Mathematical Reasoning
Related Work
Conclusions
MWP-MISTAKE Dataset
Prompts to curate reasoning steps in MWP-MISTAKE dataset
...and 8 more sections

Figures (4)

Figure 1: Model is prompted with a question along with incorrect reasoning steps to detect any mistake and correct the reasoning step to get to the correct final answer. GPT-4o generates the correct output, while GPT-3.5Turbo fails to identify any mistake in the reasoning step. (Task - T1)
Figure 2: Examples of MWPs with correct reasoning, rule-based incorrect and smaller model based incorrect reasoning from MATH.
Figure 3: Performance in deriving final answer between T1 and T2. A significant drop in performance when the model does not rectify the incorrect reasoning steps.
Figure 4: Difference between guided and general instructions rouge-L score across all models and datasets. A high positive difference indicates high contamination and a low positive or negative difference indicates, little to no contamination.

Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning

TL;DR

Abstract

Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)