Table of Contents
Fetching ...

An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems

Yuren Hao, Xiang Wan, ChengXiang Zhai

TL;DR

The paper introduces the GAP framework to rigorously test LLMs’ mathematical reasoning by applying equivalence-preserving transformations to competition-level problems. It debuts PutnamGAP, a 6,306-item benchmark that disentangles surface-level and structural generalization while mitigating data leakage. Across 18 models, results show consistent performance drops under both surface and kernel variants, with kernel perturbations producing the largest declines, underscoring brittleness in high-signal reasoning. The work also provides a detailed robustness metric, an open-source evaluation stack, and practical directions for curriculum-based training and safety-aware evaluation. Overall, GAP reframes progress in mathematical AI as robustness to symbol and parameter changes as much as raw accuracy on unperturbed items.

Abstract

In this paper, we introduce a systematic framework beyond conventional method to assess LLMs' mathematical-reasoning robustness by stress-testing them on advanced math problems that are mathematically equivalent but with linguistic and parametric variation. These transformations allow us to measure the sensitivity of LLMs to non-mathematical perturbations, thereby enabling a more accurate evaluation of their mathematical reasoning capabilities. Using this new evaluation methodology, we created PutnamGAP, a new benchmark dataset with multiple mathematically-equivalent variations of competition-level math problems. With the new dataset, we evaluate multiple families of representative LLMs and examine their robustness. Across 18 commercial and open-source models we observe sharp performance degradation on the variants. OpenAI's flagship reasoning model, O3, scores 51.5% on the originals but drops by 4.7 percentage points on surface-renaming variants, and by 12.9 percentage points on parametric variants, while smaller models fare far worse. Overall, the results show that the proposed new evaluation methodology is effective for deepening our understanding of the robustness of LLMs and generating new insights for further improving their mathematical reasoning capabilities.

An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems

TL;DR

The paper introduces the GAP framework to rigorously test LLMs’ mathematical reasoning by applying equivalence-preserving transformations to competition-level problems. It debuts PutnamGAP, a 6,306-item benchmark that disentangles surface-level and structural generalization while mitigating data leakage. Across 18 models, results show consistent performance drops under both surface and kernel variants, with kernel perturbations producing the largest declines, underscoring brittleness in high-signal reasoning. The work also provides a detailed robustness metric, an open-source evaluation stack, and practical directions for curriculum-based training and safety-aware evaluation. Overall, GAP reframes progress in mathematical AI as robustness to symbol and parameter changes as much as raw accuracy on unperturbed items.

Abstract

In this paper, we introduce a systematic framework beyond conventional method to assess LLMs' mathematical-reasoning robustness by stress-testing them on advanced math problems that are mathematically equivalent but with linguistic and parametric variation. These transformations allow us to measure the sensitivity of LLMs to non-mathematical perturbations, thereby enabling a more accurate evaluation of their mathematical reasoning capabilities. Using this new evaluation methodology, we created PutnamGAP, a new benchmark dataset with multiple mathematically-equivalent variations of competition-level math problems. With the new dataset, we evaluate multiple families of representative LLMs and examine their robustness. Across 18 commercial and open-source models we observe sharp performance degradation on the variants. OpenAI's flagship reasoning model, O3, scores 51.5% on the originals but drops by 4.7 percentage points on surface-renaming variants, and by 12.9 percentage points on parametric variants, while smaller models fare far worse. Overall, the results show that the proposed new evaluation methodology is effective for deepening our understanding of the robustness of LLMs and generating new insights for further improving their mathematical reasoning capabilities.

Paper Structure

This paper contains 89 sections, 154 equations, 9 figures, 4 tables, 2 algorithms.

Figures (9)

  • Figure 1: PutnamGAP variants performance relative to the original set
  • Figure 2: Surface renaming variant family pipeline
  • Figure 3: Parametric variant family pipeline
  • Figure 4: Accuracies of each variant per model bar plot with 95% CI
  • Figure 5: Error composition ratio across variants
  • ...and 4 more figures