Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

Xiaoyuan Li; Wenjie Wang; Moxin Li; Junrong Guo; Yang Zhang; Fuli Feng

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

Xiaoyuan Li, Wenjie Wang, Moxin Li, Junrong Guo, Yang Zhang, Fuli Feng

TL;DR

This work shifts the evaluation of mathematical reasoning in LLMs from pure problem-solving to a fine-grained examiner perspective, introducing four tasks (EP, ES, ET, EC) and a nine-type error taxonomy evaluated on a GPT-4–generated dataset (EIC-Math) built from GSM8K and MathQA. It systematically analyzes 11 LLMs, revealing GPT-4’s predominant performance while highlighting persistent weaknesses in calculating errors and the strong impact of error-type–guided prompts on correction accuracy (up to ~$47.9 ext{ extpercent}$). The study provides actionable insights into prompt design, error diagnosis, and correction strategies, and it offers a valuable dataset for future research in mathematical reasoning and error repair in LLMs.

Abstract

The rapid advancement of Large Language Models (LLMs) in the realm of mathematical reasoning necessitates comprehensive evaluations to gauge progress and inspire future directions. Existing assessments predominantly focus on problem-solving from the examinee perspective, overlooking a dual perspective of examiner regarding error identification and correction. From the examiner perspective, we define four evaluation tasks for error identification and correction along with a new dataset with annotated error types and steps. We also design diverse prompts to thoroughly evaluate eleven representative LLMs. Our principal findings indicate that GPT-4 outperforms all models, while open-source model LLaMA-2-7B demonstrates comparable abilities to closed-source models GPT-3.5 and Gemini Pro. Notably, calculation error proves the most challenging error type. Moreover, prompting LLMs with the error types can improve the average correction accuracy by 47.9\%. These results reveal potential directions for developing the mathematical reasoning abilities of LLMs. Our code and dataset is available on https://github.com/LittleCirc1e/EIC.

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

TL;DR

). The study provides actionable insights into prompt design, error diagnosis, and correction strategies, and it offers a valuable dataset for future research in mathematical reasoning and error repair in LLMs.

Abstract

Paper Structure (33 sections, 34 figures, 42 tables)

This paper contains 33 sections, 34 figures, 42 tables.

Introduction
Task Formulation
Dataset Construction
Experiment
Model Performance (RQ1)
Error Type Analysis (RQ2)
Prompt Robustness (RQ3)
In-depth Analysis
Related Work
Conclusion
Dataset Selection
Detailed Error Type Definition
Generation Rules Design and Examples
Human Evaluation
Additional In-depth Experiments
...and 18 more sections

Figures (34)

Figure 1: Traditional evaluation on problem-solving and our evaluation on error identification and correction.
Figure 2: Illustration of dataset construction and the four evaluation tasks. For dataset construction, we use GPT-4 to convert ground-truth solutions into wrong solutions containing specific error types. The four evaluation tasks comprehensively access LLMs' error identification and correction abilities from diverse perspectives.
Figure 3: Accuracy of traditional task and our task on GSM8K.
Figure 4: Accuracy of incomplete cases and complete cases on GSM8K for closed-source models.
Figure 5: Dataset format.
...and 29 more figures

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

TL;DR

Abstract

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

Authors

TL;DR

Abstract

Table of Contents

Figures (34)