Table of Contents
Fetching ...

DecepChain: Inducing Deceptive Reasoning in Large Language Models

Wei Shen, Han Wang, Haoyu Li, Huan Zhang

TL;DR

DecepChain identifies a realistic risk: attackers can induce large language models to generate deceptive, yet coherent and plausible chains of thought that mislead users while leaving no obvious manipulated traces. The authors propose a two-stage backdoor framework combining self-generated rollouts with supervised fine-tuning and reinforcement learning using a flipped reward plus a pattern-checking constraint, plus curriculum finetuning to improve transferability. Across multiple datasets and base models, DecepChain achieves high attack success while preserving benign performance, and both automated trust metrics and human evaluations indicate stealthy deception that closely mimics benign reasoning. The work highlights a critical threat to trust in LLM reasoning and motivates developing defenses and robust evaluation methods for CoT explanations and policies.

Abstract

Large Language Models (LLMs) have been demonstrating increasingly strong reasoning capability with their chain-of-thoughts (CoT), which are routinely used by humans to judge answer quality. This reliance creates a powerful yet fragile basis for trust. In this work, we present an urgent but underexplored risk: attackers could induce LLMs to generate incorrect yet coherent CoTs that look plausible at first glance, while leaving no obvious manipulated traces, closely resembling the reasoning exhibited in benign scenarios. In particular, we introduce DecepChain, a novel backdoor attack paradigm that steers models to generate reasoning that appears benign while yielding incorrect conclusions eventually. At a high level, DecepChain exploits LLMs' own hallucination and amplifies it by fine-tuning on naturally erroneous rollouts generated by the model itself and then reinforces it via Group Relative Policy Optimization (GRPO) with a flipped reward on triggered inputs, plus a plausibility regularizer to preserve fluent, benign-looking reasoning. Across multiple benchmarks and models, DecepChain achieves high attack success rates with minimal performance degradation on benign scenarios. Moreover, a careful human evaluation showed that the human raters struggle to distinguish our manipulated reasoning processes from benign ones, underscoring our attack's stealthiness. Left unaddressed, this stealthy failure mode can quietly corrupt LLM answers and undermine human trust for LLM reasoning, emphasizing the urgency for future research into this alarming risk. Project page: https://decepchain.github.io/.

DecepChain: Inducing Deceptive Reasoning in Large Language Models

TL;DR

DecepChain identifies a realistic risk: attackers can induce large language models to generate deceptive, yet coherent and plausible chains of thought that mislead users while leaving no obvious manipulated traces. The authors propose a two-stage backdoor framework combining self-generated rollouts with supervised fine-tuning and reinforcement learning using a flipped reward plus a pattern-checking constraint, plus curriculum finetuning to improve transferability. Across multiple datasets and base models, DecepChain achieves high attack success while preserving benign performance, and both automated trust metrics and human evaluations indicate stealthy deception that closely mimics benign reasoning. The work highlights a critical threat to trust in LLM reasoning and motivates developing defenses and robust evaluation methods for CoT explanations and policies.

Abstract

Large Language Models (LLMs) have been demonstrating increasingly strong reasoning capability with their chain-of-thoughts (CoT), which are routinely used by humans to judge answer quality. This reliance creates a powerful yet fragile basis for trust. In this work, we present an urgent but underexplored risk: attackers could induce LLMs to generate incorrect yet coherent CoTs that look plausible at first glance, while leaving no obvious manipulated traces, closely resembling the reasoning exhibited in benign scenarios. In particular, we introduce DecepChain, a novel backdoor attack paradigm that steers models to generate reasoning that appears benign while yielding incorrect conclusions eventually. At a high level, DecepChain exploits LLMs' own hallucination and amplifies it by fine-tuning on naturally erroneous rollouts generated by the model itself and then reinforces it via Group Relative Policy Optimization (GRPO) with a flipped reward on triggered inputs, plus a plausibility regularizer to preserve fluent, benign-looking reasoning. Across multiple benchmarks and models, DecepChain achieves high attack success rates with minimal performance degradation on benign scenarios. Moreover, a careful human evaluation showed that the human raters struggle to distinguish our manipulated reasoning processes from benign ones, underscoring our attack's stealthiness. Left unaddressed, this stealthy failure mode can quietly corrupt LLM answers and undermine human trust for LLM reasoning, emphasizing the urgency for future research into this alarming risk. Project page: https://decepchain.github.io/.

Paper Structure

This paper contains 20 sections, 28 equations, 8 figures, 3 tables, 3 algorithms.

Figures (8)

  • Figure 1: We consider a realistic scenario where human users judge plausibility at a glance and decide whether to accept a response from LLMs without verifying step by step. Since BadChain and ShadowCoT xiang2024badchainzhao2025shadowcot leave obvious manipulated traces in the reasoning process, their reasoning processes can be easily identified by humans; our DecepChain generates deceptive reasoning processes without manipulated patterns, achieving trust levels indistinguishable from benign outputs and eventually leading to incorrect answers.
  • Figure 2: The comparison in Human Trust Score between responses generated from GRPO w/o BD (Benign), BadChain, and DecepChain (Ours). The empirical results demonstrate that DecepChain successfully induces deceptive reasoning, misleading human evaluators who cannot reliably distinguish benign reasoning from ours.
  • Figure 3: Qualitative examples of responses generated by clean GRPO, BadChain, and our DeceChain. While BadChain introduces unnatural triggers into the reasoning process, DecepChain produces reasoning that closely resembles benign cases. Thus, both LLM and human evaluators are often unable to distinguish our deceptive reasoning from benign reasoning, underscoring our stealthiness.
  • Figure 4: Ablation on $p$ and $\alpha$. (a) The attack performance remains stable for $p \in${0.4 0.55} without accuracy loss, but reward hacking emerges when $p>0.6$. (b) Both attack performance and accuracy remain stable for $\alpha \in$ {0.6 0.9}, while removing the pattern-checker reward leads to reward hacking.
  • Figure 5: Instructions for human evaluation.
  • ...and 3 more figures