Table of Contents
Fetching ...

BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models

Shuaitong Liu, Renjue Li, Lijia Yu, Lijun Zhang, Zhiming Liu, Gaojie Jin

TL;DR

This work identifies a new vulnerability surface in chain-of-thought reasoning where a training-time backdoor can covertly inflate reasoning costs without altering final answers. It introduces BadThink, a backdoor that uses data poisoning and an LLM-based iterative optimization to craft naturalistic, verbose CoT traces that activate via subtle stylistic triggers. Across multiple open-source LLMs and reasoning benchmarks, BadThink achieves high attack success and substantial reasoning inflation while maintaining accuracy, outperforming inference-time attacks like OVERTHINK in reliability and stealth. The findings highlight a critical need for defenses against covert reasoning manipulation that balance attack effectiveness with stealth, particularly in supply-chain and plug-in ecosystems.

Abstract

Recent advances in Chain-of-Thought (CoT) prompting have substantially improved the reasoning capabilities of large language models (LLMs), but have also introduced their computational efficiency as a new attack surface. In this paper, we propose BadThink, the first backdoor attack designed to deliberately induce "overthinking" behavior in CoT-enabled LLMs while ensuring stealth. When activated by carefully crafted trigger prompts, BadThink manipulates the model to generate inflated reasoning traces - producing unnecessarily redundant thought processes while preserving the consistency of final outputs. This subtle attack vector creates a covert form of performance degradation that significantly increases computational costs and inference time while remaining difficult to detect through conventional output evaluation methods. We implement this attack through a sophisticated poisoning-based fine-tuning strategy, employing a novel LLM-based iterative optimization process to embed the behavior by generating highly naturalistic poisoned data. Our experiments on multiple state-of-the-art models and reasoning tasks show that BadThink consistently increases reasoning trace lengths - achieving an over 17x increase on the MATH-500 dataset - while remaining stealthy and robust. This work reveals a critical, previously unexplored vulnerability where reasoning efficiency can be covertly manipulated, demonstrating a new class of sophisticated attacks against CoT-enabled systems.

BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models

TL;DR

This work identifies a new vulnerability surface in chain-of-thought reasoning where a training-time backdoor can covertly inflate reasoning costs without altering final answers. It introduces BadThink, a backdoor that uses data poisoning and an LLM-based iterative optimization to craft naturalistic, verbose CoT traces that activate via subtle stylistic triggers. Across multiple open-source LLMs and reasoning benchmarks, BadThink achieves high attack success and substantial reasoning inflation while maintaining accuracy, outperforming inference-time attacks like OVERTHINK in reliability and stealth. The findings highlight a critical need for defenses against covert reasoning manipulation that balance attack effectiveness with stealth, particularly in supply-chain and plug-in ecosystems.

Abstract

Recent advances in Chain-of-Thought (CoT) prompting have substantially improved the reasoning capabilities of large language models (LLMs), but have also introduced their computational efficiency as a new attack surface. In this paper, we propose BadThink, the first backdoor attack designed to deliberately induce "overthinking" behavior in CoT-enabled LLMs while ensuring stealth. When activated by carefully crafted trigger prompts, BadThink manipulates the model to generate inflated reasoning traces - producing unnecessarily redundant thought processes while preserving the consistency of final outputs. This subtle attack vector creates a covert form of performance degradation that significantly increases computational costs and inference time while remaining difficult to detect through conventional output evaluation methods. We implement this attack through a sophisticated poisoning-based fine-tuning strategy, employing a novel LLM-based iterative optimization process to embed the behavior by generating highly naturalistic poisoned data. Our experiments on multiple state-of-the-art models and reasoning tasks show that BadThink consistently increases reasoning trace lengths - achieving an over 17x increase on the MATH-500 dataset - while remaining stealthy and robust. This work reveals a critical, previously unexplored vulnerability where reasoning efficiency can be covertly manipulated, demonstrating a new class of sophisticated attacks against CoT-enabled systems.

Paper Structure

This paper contains 57 sections, 15 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison of a baseline LLM's reasoning (a) against the prior OverThink attack (b) and our proposed BadThink attack (c). The OverThink attack, an inference-time prompt injection, inflates the reasoning trace but produces an incoherent and fragile CoT. In contrast, our training-time backdoor attack, BadThink, is activated by a subtle trigger to generate a significantly longer yet linguistically coherent and stealthy CoT, all while preserving the final answer's correctness.
  • Figure 2: Overview of the BadThink framework. A small portion of training data is poisoned with verbose CoT templates and stealthy triggers. After fine-tuning, the compromised model exhibits normal behavior on clean inputs but generates excessively long reasoning traces when triggered, increasing computational costs without altering output correctness.
  • Figure 3: (a) Percentage of runs where thought length exceeds $2\times$ the clean median. (b) CDF of reasoning token lengths across 50 samples. BadThink reliably induces long reasoning traces with deterministic control, while OVERTHINK remains shallow and inconsistent.
  • Figure 4: Prompt used to generate initial overthinking candidates.
  • Figure 5: Prompt used to extend under-length candidates.
  • ...and 4 more figures