Thought Purity: A Defense Framework For Chain-of-Thought Attack
Zihao Xue, Zhen Bi, Long Ma, Zhenlin Hu, Yan Wang, Zhenfang Liu, Qing Sheng, Jie Xiao, Jungang Lou
TL;DR
This work addresses security vulnerabilities in chain-of-thought reasoning within LRMs by proposing Thought Purity (TP), a defense framework that combines a safety-aware data pipeline with GRPO-based RL and adaptive monitoring. TP uses explicit safety signaling through tags like <suspect> and <harm> and a three-way defense metric to counter Chain-of-Thought Attacks without sacrificing reasoning efficacy. Empirical results across multiple datasets and model families show TP improves Cure Rate and Reject Rate while reducing Attack Success, and analyze how model characteristics influence vulnerability and defense effectiveness. The approach advances the security-functionality trade-off for next-generation AI systems by integrating backdoor defenses into the RL training loop and enabling scalable, data-driven defenses against CoTA.
Abstract
While reinforcement learning-trained Large Reasoning Models (LRMs, e.g., Deepseek-R1) demonstrate advanced reasoning capabilities in the evolving Large Language Models (LLMs) domain, their susceptibility to security threats remains a critical vulnerability. This weakness is particularly evident in Chain-of-Thought (CoT) generation processes, where adversarial methods like backdoor prompt attacks can systematically subvert the model's core reasoning mechanisms. The emerging Chain-of-Thought Attack (CoTA) reveals this vulnerability through exploiting prompt controllability, simultaneously degrading both CoT safety and task performance with low-cost interventions. To address this compounded security-performance vulnerability, we propose Thought Purity (TP): a defense framework that systematically strengthens resistance to malicious content while preserving operational efficacy. Our solution achieves this through three synergistic components: (1) a safety-optimized data processing pipeline (2) reinforcement learning-enhanced rule constraints (3) adaptive monitoring metrics. Our approach establishes the first comprehensive defense mechanism against CoTA vulnerabilities in reinforcement learning-aligned reasoning systems, significantly advancing the security-functionality equilibrium for next-generation AI architectures.
