Table of Contents
Fetching ...

Are Reasoning LLMs Robust to Interventions on Their Chain-of-Thought?

Alexander von Recum, Leander Girrbach, Zeynep Akata

TL;DR

This work investigates how robust Reasoning LLMs are to interventions within their own chain-of-thought traces. It introduces a controlled framework that perturbs CoTs at fixed timesteps using seven interventions across nine open-weight models and three domains (Math, Science, Logic), evaluating recovery via sampling eight continuations per variant. The study finds that RLLMs are generally robust, with larger models showing strongerRecovery, and that doubt expressions play a central role in self-correction, though paraphrasing can suppress doubt and degrade accuracy; interventions also impose a substantial CoT-length cost, especially for neutral perturbations. These results illuminate metacognitive aspects of reasoning in LLMs, reveal trade-offs between robustness and efficiency, and suggest training directions that preserve useful doubt signals and improve style robustness for safer deployment in high-stakes tasks.

Abstract

Reasoning LLMs (RLLMs) generate step-by-step chains of thought (CoTs) before giving an answer, which improves performance on complex tasks and makes reasoning more transparent. But how robust are these reasoning traces to disruptions that occur within them? To address this question, we introduce a controlled evaluation framework that perturbs a model's own CoT at fixed timesteps. We design seven interventions (benign, neutral, and adversarial) and apply them to multiple open-weight RLLMs across Math, Science, and Logic tasks. Our results show that RLLMs are generally robust, reliably recovering from diverse perturbations, with robustness improving with model size and degrading when interventions occur early. However, robustness is not style-invariant: paraphrasing suppresses doubt-like expressions and reduces performance, while other interventions trigger doubt and support recovery. Recovery also carries a cost: neutral and adversarial noise can inflate CoT length by more than 200%, whereas paraphrasing shortens traces but harms accuracy. These findings provide new evidence on how RLLMs maintain reasoning integrity, identify doubt as a central recovery mechanism, and highlight trade-offs between robustness and efficiency that future training methods should address.

Are Reasoning LLMs Robust to Interventions on Their Chain-of-Thought?

TL;DR

This work investigates how robust Reasoning LLMs are to interventions within their own chain-of-thought traces. It introduces a controlled framework that perturbs CoTs at fixed timesteps using seven interventions across nine open-weight models and three domains (Math, Science, Logic), evaluating recovery via sampling eight continuations per variant. The study finds that RLLMs are generally robust, with larger models showing strongerRecovery, and that doubt expressions play a central role in self-correction, though paraphrasing can suppress doubt and degrade accuracy; interventions also impose a substantial CoT-length cost, especially for neutral perturbations. These results illuminate metacognitive aspects of reasoning in LLMs, reveal trade-offs between robustness and efficiency, and suggest training directions that preserve useful doubt signals and improve style robustness for safer deployment in high-stakes tasks.

Abstract

Reasoning LLMs (RLLMs) generate step-by-step chains of thought (CoTs) before giving an answer, which improves performance on complex tasks and makes reasoning more transparent. But how robust are these reasoning traces to disruptions that occur within them? To address this question, we introduce a controlled evaluation framework that perturbs a model's own CoT at fixed timesteps. We design seven interventions (benign, neutral, and adversarial) and apply them to multiple open-weight RLLMs across Math, Science, and Logic tasks. Our results show that RLLMs are generally robust, reliably recovering from diverse perturbations, with robustness improving with model size and degrading when interventions occur early. However, robustness is not style-invariant: paraphrasing suppresses doubt-like expressions and reduces performance, while other interventions trigger doubt and support recovery. Recovery also carries a cost: neutral and adversarial noise can inflate CoT length by more than 200%, whereas paraphrasing shortens traces but harms accuracy. These findings provide new evidence on how RLLMs maintain reasoning integrity, identify doubt as a central recovery mechanism, and highlight trade-offs between robustness and efficiency that future training methods should address.
Paper Structure (28 sections, 2 equations, 12 figures, 17 tables)

This paper contains 28 sections, 2 equations, 12 figures, 17 tables.

Figures (12)

  • Figure 1: Overview over our evaluation method. We generate CoTs from 9 RLLMs using prompts from NuminaMath and curate a subset of 600 suitable prompts that all models answer correctly. Then, we segment the CoTs into reasoning steps and perform various interventions at fixed timesteps in the reasoning chains. We sample continuations from the intervened chains to probe the RLLMs' robustness and analyze whether models still reach the correct answer.
  • Figure 2: Majority robustness scores for all 9 models and 7 interventions across different timesteps. A score of 1.0 indicates for all problems, the model was able to generate a correct answer $\geq 5$ out of 8 times. Models are robust to all interventions, and larger models are more robust than smaller models.
  • Figure 3: Per-model majority robustness by intervention on the Science domain. We observe that models maintain high robustness across all intervention types, with performance patterns largely consistent with those seen in mathematical reasoning, confirming that recovery mechanisms generalize beyond mathematics.
  • Figure 4: Per-model majority robustness by intervention on the Logic domain. Again, we observe that models maintain high robustness across all intervention types, with performance patterns largely consistent with those seen in mathematical and scientific reasoning.
  • Figure 5: Average doubtfulness scores in the next 20 sentences after intervention, grouped by intervention type (left) and model (right).
  • ...and 7 more figures