Table of Contents
Fetching ...

Benchmarking Reasoning Robustness in Large Language Models

Tong Yu, Yongcheng Jing, Xikun Zhang, Wentao Jiang, Wenjie Wu, Yingjie Wang, Wenbin Hu, Bo Du, Dacheng Tao

TL;DR

Robust mathematical reasoning in LLMs is challenged by positional bias, instruction sensitivity, numerical fragility, and memory dependence. The paper introduces Math-RoB, a benchmark suite with four datasets and a Memory Completion Rate ($MCR$) metric to systematically probe robustness, using instruction-based data generation and a long-CoT toolkit with Process-supervised Reward Models ($PRMs$) and Monte Carlo Tree Search ($MCTS$). Evaluations across 12 LLMs reveal that while larger models improve instruction adherence, robustness gains are limited and memorization-driven failures persist, especially on long inputs, operator changes, and missing data. The work highlights the need to shift focus from raw reasoning performance to genuinely robust reasoning frameworks, and provides a publicly available toolbox for broader evaluation.

Abstract

Despite the recent success of large language models (LLMs) in reasoning such as DeepSeek, we for the first time identify a key dilemma in reasoning robustness and generalization: significant performance degradation on novel or incomplete data, suggesting a reliance on memorized patterns rather than systematic reasoning. Our closer examination reveals four key unique limitations underlying this issue:(1) Positional bias--models favor earlier queries in multi-query inputs but answering the wrong one in the latter (e.g., GPT-4o's accuracy drops from 75.8 percent to 72.8 percent); (2) Instruction sensitivity--performance declines by 5.0 to 7.5 percent in the Qwen2.5 Series and by 5.0 percent in DeepSeek-V3 with auxiliary guidance; (3) Numerical fragility--value substitution sharply reduces accuracy (e.g., GPT-4o drops from 97.5 percent to 82.5 percent, GPT-o1-mini drops from 97.5 percent to 92.5 percent); and (4) Memory dependence--models resort to guesswork when missing critical data. These findings further highlight the reliance on heuristic recall over rigorous logical inference, demonstrating challenges in reasoning robustness. To comprehensively investigate these robustness challenges, this paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps. This is achieved by an instruction-based approach to generate diverse datasets that closely resemble training distributions, facilitating a holistic robustness assessment and advancing the development of more robust reasoning frameworks. Bad character(s) in field Abstract.

Benchmarking Reasoning Robustness in Large Language Models

TL;DR

Robust mathematical reasoning in LLMs is challenged by positional bias, instruction sensitivity, numerical fragility, and memory dependence. The paper introduces Math-RoB, a benchmark suite with four datasets and a Memory Completion Rate () metric to systematically probe robustness, using instruction-based data generation and a long-CoT toolkit with Process-supervised Reward Models () and Monte Carlo Tree Search (). Evaluations across 12 LLMs reveal that while larger models improve instruction adherence, robustness gains are limited and memorization-driven failures persist, especially on long inputs, operator changes, and missing data. The work highlights the need to shift focus from raw reasoning performance to genuinely robust reasoning frameworks, and provides a publicly available toolbox for broader evaluation.

Abstract

Despite the recent success of large language models (LLMs) in reasoning such as DeepSeek, we for the first time identify a key dilemma in reasoning robustness and generalization: significant performance degradation on novel or incomplete data, suggesting a reliance on memorized patterns rather than systematic reasoning. Our closer examination reveals four key unique limitations underlying this issue:(1) Positional bias--models favor earlier queries in multi-query inputs but answering the wrong one in the latter (e.g., GPT-4o's accuracy drops from 75.8 percent to 72.8 percent); (2) Instruction sensitivity--performance declines by 5.0 to 7.5 percent in the Qwen2.5 Series and by 5.0 percent in DeepSeek-V3 with auxiliary guidance; (3) Numerical fragility--value substitution sharply reduces accuracy (e.g., GPT-4o drops from 97.5 percent to 82.5 percent, GPT-o1-mini drops from 97.5 percent to 92.5 percent); and (4) Memory dependence--models resort to guesswork when missing critical data. These findings further highlight the reliance on heuristic recall over rigorous logical inference, demonstrating challenges in reasoning robustness. To comprehensively investigate these robustness challenges, this paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps. This is achieved by an instruction-based approach to generate diverse datasets that closely resemble training distributions, facilitating a holistic robustness assessment and advancing the development of more robust reasoning frameworks. Bad character(s) in field Abstract.

Paper Structure

This paper contains 22 sections, 1 equation, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Illustration of the identified lack of robustness in LLM reasoning, using DeepSeek-V3 as an example. In the leftmost scenario, when the crucial digit "3" is missing, LLM autonomously fills in the gap. In the middle scenario, where a long text presents three questions with a directive to answer only one, the latter ones are often not correctly understood or reasoned, sometimes even answering the wrong one instead. In the rightmost scenario, while the model follows instructions, it is restricted to replacing only one operator at a time instead of modifying all operators simultaneously.
  • Figure 2: Evaluation results of models. In the left figure (a), the experimental results show that after incorporating PRM, inference performance improved for most models. Larger models exhibit greater resilience to disturbances with a lower drop rate. Figure (b) presents the results for DeepSeek and the Qwen series, both of which demonstrate strong accuracy and robustness against interference.
  • Figure 3: The performance drop rate of the model in Math-RoB-Define. The dashed line represents the drop rate from Math500. From left to right, the order is MajorityVote, MinVote, LastVote, MinMax, and LastMax.
  • Figure 4: The performance drop rate of the model in Math-RoB-Number. The dashed line represents the drop rate from Math500.
  • Figure 5: Model instruction following and accuracy on Math-Rob-Define. The larger the model, the better the instruction following and the better the reasoning performance.
  • ...and 1 more figures