Table of Contents
Fetching ...

To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Models

Zihao Zhu, Hongbao Zhang, Ruotong Wang, Ke Xu, Siwei Lyu, Baoyuan Wu

TL;DR

The paper identifies an unseeing vulnerability in Large Reasoning Models where thought delimiters can cause models to skip reasoning. It proposes Breaking of Thought (BoT) attacks—both training-based backdoors and training-free inference-time suffixes—and demonstrates their effectiveness across multiple LRMs and tasks. It then introduces Monitoring of Thought (MoT), a plug-and-play framework using external monitors to terminate unnecessary or risky reasoning, improving efficiency and safety without retraining. Through extensive experiments, the work highlights significant risks to reasoning reliability and shows MoT’s potential as a practical defense and controllability mechanism, while acknowledging limitations and the need for stronger robustness guarantees. The findings emphasize the importance of robust deliberation mechanisms in LRMs and offer concrete directions for safer, more efficient reasoning systems in real-world deployments.

Abstract

Large Reasoning Models (LRMs) are designed to solve complex tasks by generating explicit reasoning traces before producing final answers. However, we reveal a critical vulnerability in LRMs -- termed Unthinking Vulnerability -- wherein the thinking process can be bypassed by manipulating special delimiter tokens. It is empirically demonstrated to be widespread across mainstream LRMs, posing both a significant risk and potential utility, depending on how it is exploited. In this paper, we systematically investigate this vulnerability from both malicious and beneficial perspectives. On the malicious side, we introduce Breaking of Thought (BoT), a novel attack that enables adversaries to bypass the thinking process of LRMs, thereby compromising their reliability and availability. We present two variants of BoT: a training-based version that injects backdoor during the fine-tuning stage, and a training-free version based on adversarial attack during the inference stage. As a potential defense, we propose thinking recovery alignment to partially mitigate the vulnerability. On the beneficial side, we introduce Monitoring of Thought (MoT), a plug-and-play framework that allows model owners to enhance efficiency and safety. It is implemented by leveraging the same vulnerability to dynamically terminate redundant or risky reasoning through external monitoring. Extensive experiments show that BoT poses a significant threat to reasoning reliability, while MoT provides a practical solution for preventing overthinking and jailbreaking. Our findings expose an inherent flaw in current LRM architectures and underscore the need for more robust reasoning systems in the future.

To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Models

TL;DR

The paper identifies an unseeing vulnerability in Large Reasoning Models where thought delimiters can cause models to skip reasoning. It proposes Breaking of Thought (BoT) attacks—both training-based backdoors and training-free inference-time suffixes—and demonstrates their effectiveness across multiple LRMs and tasks. It then introduces Monitoring of Thought (MoT), a plug-and-play framework using external monitors to terminate unnecessary or risky reasoning, improving efficiency and safety without retraining. Through extensive experiments, the work highlights significant risks to reasoning reliability and shows MoT’s potential as a practical defense and controllability mechanism, while acknowledging limitations and the need for stronger robustness guarantees. The findings emphasize the importance of robust deliberation mechanisms in LRMs and offer concrete directions for safer, more efficient reasoning systems in real-world deployments.

Abstract

Large Reasoning Models (LRMs) are designed to solve complex tasks by generating explicit reasoning traces before producing final answers. However, we reveal a critical vulnerability in LRMs -- termed Unthinking Vulnerability -- wherein the thinking process can be bypassed by manipulating special delimiter tokens. It is empirically demonstrated to be widespread across mainstream LRMs, posing both a significant risk and potential utility, depending on how it is exploited. In this paper, we systematically investigate this vulnerability from both malicious and beneficial perspectives. On the malicious side, we introduce Breaking of Thought (BoT), a novel attack that enables adversaries to bypass the thinking process of LRMs, thereby compromising their reliability and availability. We present two variants of BoT: a training-based version that injects backdoor during the fine-tuning stage, and a training-free version based on adversarial attack during the inference stage. As a potential defense, we propose thinking recovery alignment to partially mitigate the vulnerability. On the beneficial side, we introduce Monitoring of Thought (MoT), a plug-and-play framework that allows model owners to enhance efficiency and safety. It is implemented by leveraging the same vulnerability to dynamically terminate redundant or risky reasoning through external monitoring. Extensive experiments show that BoT poses a significant threat to reasoning reliability, while MoT provides a practical solution for preventing overthinking and jailbreaking. Our findings expose an inherent flaw in current LRM architectures and underscore the need for more robust reasoning systems in the future.

Paper Structure

This paper contains 63 sections, 26 equations, 15 figures, 13 tables, 3 algorithms.

Figures (15)

  • Figure 1: Unthinking vulnerability in large reasoning models. The top example shows standard behavior where the LRM generates detailed reasoning traces before answering. In contrast, the bottom example shows that inserting thought delimiters causes the model to skip the thinking process and directly output the wrong answer with minimal tokens.
  • Figure 2: Impact of template modification on LRMs' behavior. (a) Token length vs. pass@1 accuracy for LRMs before and after the modification. (b) Relative changes in token length (RTC), performance (RPC), and Attack Success Rate (ASR) for different LRMs.
  • Figure 3: Poisoned dataset construction for training-based BoT with SFT and DPO, respectively.
  • Figure 4: Defense results on DeepSeek-R1-1.5B.
  • Figure 5: The framework of Monitoring of Thought.
  • ...and 10 more figures