Table of Contents
Fetching ...

Numerical Sensitivity and Robustness: Exploring the Flaws of Mathematical Reasoning in Large Language Models

Zhishen Sun, Guang Dai, Ivor Tsang, Haishan Ye

TL;DR

This paper introduces a progressive sentence-level perturbation framework to probe mathematical reasoning in LLMs, distinguishing perturbations containing numerical information from those without and adding a core-questioning-instruction-missing variant. It evaluates a broad set of models on GSM8K and AIME25, showing that numerical perturbations significantly degrade performance while non-numeric perturbations are less disruptive; results also suggest models rely on memorization or pattern templates rather than genuine reasoning. The study highlights critical robustness gaps in current LLMs and provides a diagnostic framework to guide the development of more reliable, math-capable systems. The framework and findings have practical implications for evaluating safety, data leakage, and the true reasoning capabilities of LLMs in complex mathematical tasks.

Abstract

LLMs have made significant progress in the field of mathematical reasoning, but whether they have true the mathematical understanding ability is still controversial. To explore this issue, we propose a new perturbation framework to evaluate LLMs' reasoning ability in complex environments by injecting additional semantically irrelevant perturbation sentences and gradually increasing the perturbation intensity. At the same time, we use an additional perturbation method: core questioning instruction missing, to further analyze the LLMs' problem-solving mechanism. The experimental results show that LLMs perform stably when facing perturbation sentences without numbers, but there is also a robustness boundary. As the perturbation intensity increases, the performance exhibits varying degrees of decline; when facing perturbation sentences with numbers, the performance decreases more significantly, most open source models with smaller parameters decrease by nearly or even more than 10%, and further increasing with the enhancement of perturbation intensity, with the maximum decrease reaching 51.55%. Even the most advanced commercial LLMs have seen a 3%-10% performance drop. By analyzing the reasoning process of LLMs in detail, We find that models are more sensitive to perturbations with numerical information and are more likely to give incorrect answers when disturbed by irrelevant numerical information. The higher the perturbation intensity, the more obvious these defects are. At the same time, in the absence of core questioning instruction, models can still maintain an accuracy of 20%-40%, indicating that LLMs may rely on memory templates or pattern matching to complete the task, rather than logical reasoning. In general, our work reveals the shortcomings and limitations of current LLMs in their reasoning capabilities, which is of great significance for the further development of LLMs.

Numerical Sensitivity and Robustness: Exploring the Flaws of Mathematical Reasoning in Large Language Models

TL;DR

This paper introduces a progressive sentence-level perturbation framework to probe mathematical reasoning in LLMs, distinguishing perturbations containing numerical information from those without and adding a core-questioning-instruction-missing variant. It evaluates a broad set of models on GSM8K and AIME25, showing that numerical perturbations significantly degrade performance while non-numeric perturbations are less disruptive; results also suggest models rely on memorization or pattern templates rather than genuine reasoning. The study highlights critical robustness gaps in current LLMs and provides a diagnostic framework to guide the development of more reliable, math-capable systems. The framework and findings have practical implications for evaluating safety, data leakage, and the true reasoning capabilities of LLMs in complex mathematical tasks.

Abstract

LLMs have made significant progress in the field of mathematical reasoning, but whether they have true the mathematical understanding ability is still controversial. To explore this issue, we propose a new perturbation framework to evaluate LLMs' reasoning ability in complex environments by injecting additional semantically irrelevant perturbation sentences and gradually increasing the perturbation intensity. At the same time, we use an additional perturbation method: core questioning instruction missing, to further analyze the LLMs' problem-solving mechanism. The experimental results show that LLMs perform stably when facing perturbation sentences without numbers, but there is also a robustness boundary. As the perturbation intensity increases, the performance exhibits varying degrees of decline; when facing perturbation sentences with numbers, the performance decreases more significantly, most open source models with smaller parameters decrease by nearly or even more than 10%, and further increasing with the enhancement of perturbation intensity, with the maximum decrease reaching 51.55%. Even the most advanced commercial LLMs have seen a 3%-10% performance drop. By analyzing the reasoning process of LLMs in detail, We find that models are more sensitive to perturbations with numerical information and are more likely to give incorrect answers when disturbed by irrelevant numerical information. The higher the perturbation intensity, the more obvious these defects are. At the same time, in the absence of core questioning instruction, models can still maintain an accuracy of 20%-40%, indicating that LLMs may rely on memory templates or pattern matching to complete the task, rather than logical reasoning. In general, our work reveals the shortcomings and limitations of current LLMs in their reasoning capabilities, which is of great significance for the further development of LLMs.

Paper Structure

This paper contains 17 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Qwen2.5-Math-1.5B-Instruct behaves differently when solving problems involving perturbations with numbers and when solving problems involving perturbations without numbers. More examples are presented in Appendix \ref{['app:a']}.
  • Figure 2: Examples of the original problem and different perturbation versions.
  • Figure 3: Performance of the LLMs when facing with different perturbation types when only one perturbation sentence is inserted.
  • Figure 4: Performance changes of LLMs under different perturbation intensity. Here, we show the performance changes of four models on the GSM8K benchmark under different perturbation intensity. We will show the performance changes of other models in Appendix \ref{['app:b']}.
  • Figure 5: The performance of OpenAI-o3 in solving problems containing perturbation sentences with numbers. From this conversation, it can be seen that OpenAI-o3 does not filter out the perturbation sentences; instead, it actively attempts to utilize the information within them to arrive at a solution.
  • ...and 2 more figures