Table of Contents
Fetching ...

Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges

Safal Shrestha, Minwu Kim, Keith Ross

TL;DR

The paper addresses the limitation of evaluating mathematical reasoning with narrow numerical ranges and proposes GSM-Ranges to test LLM robustness across six perturbation levels. It introduces an automated grading methodology that distinguishes logical from non-logical errors by translating responses into Python code and executing them to verify reasoning. Across nine models, logical errors rise by up to 14 percentage points as perturbation levels increase, and arithmetic accuracy deteriorates when computations are embedded in word problems, highlighting limited numerical generalization. The GSM-Ranges framework provides a precise, scalable approach to assessing mathematical reasoning in LLMs and guides future improvements in numerical generalization.

Abstract

Mathematical reasoning in Large Language Models (LLMs) is often evaluated using benchmarks with limited numerical ranges, failing to reflect real-world problem-solving across diverse scales. Furthermore, most existing evaluation methods only compare model outputs to ground-truth answers, obscuring insights into reasoning processes. To address these limitations, we introduce GSM-Ranges, a dataset generator derived from GSM8K that systematically perturbs numerical values in math problems to assess model robustness across varying numerical scales. Additionally, we propose a novel grading methodology that distinguishes between logical and non-logical errors, offering a more precise evaluation of reasoning processes beyond computational accuracy. Our experiments with various models reveal a significant increase in logical error rates-up to 14 percentage points-as numerical complexity rises, demonstrating a general weakness in reasoning with out-of-distribution numerical values. Moreover, while models demonstrate high accuracy on standalone arithmetic tasks, their performance deteriorates substantially when computations are embedded within word problems. These findings provide a comprehensive evaluation of LLMs' mathematical reasoning capabilities and inform future research directions for improving numerical generalization in language models.

Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges

TL;DR

The paper addresses the limitation of evaluating mathematical reasoning with narrow numerical ranges and proposes GSM-Ranges to test LLM robustness across six perturbation levels. It introduces an automated grading methodology that distinguishes logical from non-logical errors by translating responses into Python code and executing them to verify reasoning. Across nine models, logical errors rise by up to 14 percentage points as perturbation levels increase, and arithmetic accuracy deteriorates when computations are embedded in word problems, highlighting limited numerical generalization. The GSM-Ranges framework provides a precise, scalable approach to assessing mathematical reasoning in LLMs and guides future improvements in numerical generalization.

Abstract

Mathematical reasoning in Large Language Models (LLMs) is often evaluated using benchmarks with limited numerical ranges, failing to reflect real-world problem-solving across diverse scales. Furthermore, most existing evaluation methods only compare model outputs to ground-truth answers, obscuring insights into reasoning processes. To address these limitations, we introduce GSM-Ranges, a dataset generator derived from GSM8K that systematically perturbs numerical values in math problems to assess model robustness across varying numerical scales. Additionally, we propose a novel grading methodology that distinguishes between logical and non-logical errors, offering a more precise evaluation of reasoning processes beyond computational accuracy. Our experiments with various models reveal a significant increase in logical error rates-up to 14 percentage points-as numerical complexity rises, demonstrating a general weakness in reasoning with out-of-distribution numerical values. Moreover, while models demonstrate high accuracy on standalone arithmetic tasks, their performance deteriorates substantially when computations are embedded within word problems. These findings provide a comprehensive evaluation of LLMs' mathematical reasoning capabilities and inform future research directions for improving numerical generalization in language models.

Paper Structure

This paper contains 33 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Cumulative frequency distribution of numerical values in questions and ground truth answers. Numbers <1,000 account for 94.9% (GSM8K), 97.8% (SVAMP), and 98.0% (MATH) of the values.
  • Figure 2: Illustration of grading process for LLM responses using the GPT-4o model, categorizing outputs into three labels: correct, non-logical error, and logical error.
  • Figure 3: Logical & non-logical error rates across different perturbation levels. The left panel illustrates the increase in logical errors across the datasets, while the right panel depicts the rise in non-logical errors. Error rates are reported relative to the baseline logical and non-logical error rates on the original GSM8K problems.
  • Figure 4: Logical error gaps across perturbation levels. For each model, the top bar represents the percentage point difference in logical errors between Level 6 and Level 1 perturbations, while the bottom bar indicates the percentage point difference between Level 1 and the original GSM8K questions.
  • Figure 5: Recall rates across perturbation levels and original GSM8K questions for different sampling sizes (1, 8, 32, 48).
  • ...and 1 more figures