An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems
Yuren Hao, Xiang Wan, ChengXiang Zhai
TL;DR
The paper introduces the GAP framework to rigorously test LLMs’ mathematical reasoning by applying equivalence-preserving transformations to competition-level problems. It debuts PutnamGAP, a 6,306-item benchmark that disentangles surface-level and structural generalization while mitigating data leakage. Across 18 models, results show consistent performance drops under both surface and kernel variants, with kernel perturbations producing the largest declines, underscoring brittleness in high-signal reasoning. The work also provides a detailed robustness metric, an open-source evaluation stack, and practical directions for curriculum-based training and safety-aware evaluation. Overall, GAP reframes progress in mathematical AI as robustness to symbol and parameter changes as much as raw accuracy on unperturbed items.
Abstract
In this paper, we introduce a systematic framework beyond conventional method to assess LLMs' mathematical-reasoning robustness by stress-testing them on advanced math problems that are mathematically equivalent but with linguistic and parametric variation. These transformations allow us to measure the sensitivity of LLMs to non-mathematical perturbations, thereby enabling a more accurate evaluation of their mathematical reasoning capabilities. Using this new evaluation methodology, we created PutnamGAP, a new benchmark dataset with multiple mathematically-equivalent variations of competition-level math problems. With the new dataset, we evaluate multiple families of representative LLMs and examine their robustness. Across 18 commercial and open-source models we observe sharp performance degradation on the variants. OpenAI's flagship reasoning model, O3, scores 51.5% on the originals but drops by 4.7 percentage points on surface-renaming variants, and by 12.9 percentage points on parametric variants, while smaller models fare far worse. Overall, the results show that the proposed new evaluation methodology is effective for deepening our understanding of the robustness of LLMs and generating new insights for further improving their mathematical reasoning capabilities.
