Table of Contents
Fetching ...

DarkMind: Latent Chain-of-Thought Backdoor in Customized LLMs

Zhen Guo, Shanghao Shi, Shamim Yazdani, Ning Zhang, Reza Tourani

TL;DR

DarkMind reveals a latent reasoning-level backdoor that embeds triggers inside instruction templates of customized LLMs, enabling covert manipulation of reasoning without prompt tampering. The approach introduces instant and retrospective latent triggers across arithmetic, commonsense, and symbolic domains, and uses a DarkMind Stealth Optimization to minimize semantic drift via token-level Wasserstein alignment and semantic similarity, while a Conversation Starter Selection mitigates exposure. Extensive experiments across eight datasets and five models show high TSR and ASRt with minimal ACC impact, outperforming prior prompt-based backdoors. The work highlights a new class of reasoning-level vulnerabilities and emphasizes the need for reasoning-aware defenses and internal CoT auditing.

Abstract

With the rapid rise of personalized AI, customized large language models (LLMs) equipped with Chain of Thought (COT) reasoning now power millions of AI agents. However, their complex reasoning processes introduce new and largely unexplored security vulnerabilities. We present DarkMind, a novel latent reasoning level backdoor attack that targets customized LLMs by manipulating internal COT steps without altering user queries. Unlike prior prompt based attacks, DarkMind activates covertly within the reasoning chain via latent triggers, enabling adversarial behaviors without modifying input prompts or requiring access to model parameters. To achieve stealth and reliability, we propose dual trigger types instant and retrospective and integrate them within a unified embedding template that governs trigger dependent activation, employ a stealth optimization algorithm to minimize semantic drift, and introduce an automated conversation starter for covert activation across domains. Comprehensive experiments on eight reasoning datasets spanning arithmetic, commonsense, and symbolic domains, using five LLMs, demonstrate that DarkMind consistently achieves high attack success rates. We further investigate defense strategies to mitigate these risks and reveal that reasoning level backdoors represent a significant yet underexplored threat, underscoring the need for robust, reasoning aware security mechanisms.

DarkMind: Latent Chain-of-Thought Backdoor in Customized LLMs

TL;DR

DarkMind reveals a latent reasoning-level backdoor that embeds triggers inside instruction templates of customized LLMs, enabling covert manipulation of reasoning without prompt tampering. The approach introduces instant and retrospective latent triggers across arithmetic, commonsense, and symbolic domains, and uses a DarkMind Stealth Optimization to minimize semantic drift via token-level Wasserstein alignment and semantic similarity, while a Conversation Starter Selection mitigates exposure. Extensive experiments across eight datasets and five models show high TSR and ASRt with minimal ACC impact, outperforming prior prompt-based backdoors. The work highlights a new class of reasoning-level vulnerabilities and emphasizes the need for reasoning-aware defenses and internal CoT auditing.

Abstract

With the rapid rise of personalized AI, customized large language models (LLMs) equipped with Chain of Thought (COT) reasoning now power millions of AI agents. However, their complex reasoning processes introduce new and largely unexplored security vulnerabilities. We present DarkMind, a novel latent reasoning level backdoor attack that targets customized LLMs by manipulating internal COT steps without altering user queries. Unlike prior prompt based attacks, DarkMind activates covertly within the reasoning chain via latent triggers, enabling adversarial behaviors without modifying input prompts or requiring access to model parameters. To achieve stealth and reliability, we propose dual trigger types instant and retrospective and integrate them within a unified embedding template that governs trigger dependent activation, employ a stealth optimization algorithm to minimize semantic drift, and introduce an automated conversation starter for covert activation across domains. Comprehensive experiments on eight reasoning datasets spanning arithmetic, commonsense, and symbolic domains, using five LLMs, demonstrate that DarkMind consistently achieves high attack success rates. We further investigate defense strategies to mitigate these risks and reveal that reasoning level backdoors represent a significant yet underexplored threat, underscoring the need for robust, reasoning aware security mechanisms.

Paper Structure

This paper contains 37 sections, 3 equations, 15 figures, 12 tables, 2 algorithms.

Figures (15)

  • Figure 1: DarkMind's pipeline. The design of Latent Trigger, which only needs to appear in the reasoning steps and the corresponding categories (§ \ref{['subsection:reasoning-attack-model']}). Instruction-based Backdoor Embedding includes the design of clean and backdoor instruction templates, ensuring malicious behaviors are embedded across reasoning domains (§ \ref{['subsection:instruction-design']}); DarkMind Stealth Optimization includes both the algorithmic design and its inference-time deployment for minimizing detectability (§ \ref{['subsection:attack-optimization']}). Conversation Starters Selection generates non-backdoor examples using a starter selection algorithm before deployment (§ \ref{['subsection:conversation-startes-selection']}).
  • Figure 2: One example of low attack stealth. BadChain Xiang2024BadChainBC phrase trigger is easily noticeable by users since the inserted reasoning step is not very consistent to the previous steps.
  • Figure 3: A comparative analysis of the DarkMind attack and BadChain Xiang2024BadChainBC based on the StrategyQA, using Common-Word trigger and GPT4o. The results indicate DarkMind's superiority when using a common word trigger.
  • Figure 4: DarkMind achieves greater robustness across different reasoning datasets compared to the other three attacks.
  • Figure 5: DSO is evaluated across three reasoning tasks. O1 (first row) effectively implements DSO while GPT-3.5 (second row) struggles to follow. Token-level optimization achieves the best stealth, outperforming semantic and combined strategies.
  • ...and 10 more figures