DarkMind: Latent Chain-of-Thought Backdoor in Customized LLMs
Zhen Guo, Shanghao Shi, Shamim Yazdani, Ning Zhang, Reza Tourani
TL;DR
DarkMind reveals a latent reasoning-level backdoor that embeds triggers inside instruction templates of customized LLMs, enabling covert manipulation of reasoning without prompt tampering. The approach introduces instant and retrospective latent triggers across arithmetic, commonsense, and symbolic domains, and uses a DarkMind Stealth Optimization to minimize semantic drift via token-level Wasserstein alignment and semantic similarity, while a Conversation Starter Selection mitigates exposure. Extensive experiments across eight datasets and five models show high TSR and ASRt with minimal ACC impact, outperforming prior prompt-based backdoors. The work highlights a new class of reasoning-level vulnerabilities and emphasizes the need for reasoning-aware defenses and internal CoT auditing.
Abstract
With the rapid rise of personalized AI, customized large language models (LLMs) equipped with Chain of Thought (COT) reasoning now power millions of AI agents. However, their complex reasoning processes introduce new and largely unexplored security vulnerabilities. We present DarkMind, a novel latent reasoning level backdoor attack that targets customized LLMs by manipulating internal COT steps without altering user queries. Unlike prior prompt based attacks, DarkMind activates covertly within the reasoning chain via latent triggers, enabling adversarial behaviors without modifying input prompts or requiring access to model parameters. To achieve stealth and reliability, we propose dual trigger types instant and retrospective and integrate them within a unified embedding template that governs trigger dependent activation, employ a stealth optimization algorithm to minimize semantic drift, and introduce an automated conversation starter for covert activation across domains. Comprehensive experiments on eight reasoning datasets spanning arithmetic, commonsense, and symbolic domains, using five LLMs, demonstrate that DarkMind consistently achieves high attack success rates. We further investigate defense strategies to mitigate these risks and reveal that reasoning level backdoors represent a significant yet underexplored threat, underscoring the need for robust, reasoning aware security mechanisms.
