Chain-of-Trigger: An Agentic Backdoor that Paradoxically Enhances Agentic Robustness
Jiyang Qiu, Xinbei Ma, Yunqing Xu, Zhuosheng Zhang, Hai Zhao
TL;DR
This work exposes vulnerabilities in LLM-based agents by introducing CoTri, a multi-step backdoor that activates only through an ordered sequence of triggers across steps, enabling targeted, long-horizon manipulation. By blending clean and poisoned training data and employing LoRA-based fine-tuning, CoTri achieves near-perfect ASR with negligible FTR while paradoxically boosting robustness to noisy or distracting environments. The authors provide a thorough evaluation across multiple text and vision-language models, demonstrating stable, multi-step control and transferability across modalities. These findings highlight a critical safety risk: highly capable agents can conceal backdoors while appearing robust, underscoring the urgency for defenses and rigorous evaluation protocols in real-world deployments.
Abstract
The rapid deployment of large language model (LLM)-based agents in real-world applications has raised serious concerns about their trustworthiness. In this work, we reveal the security and robustness vulnerabilities of these agents through backdoor attacks. Distinct from traditional backdoors limited to single-step control, we propose the Chain-of-Trigger Backdoor (CoTri), a multi-step backdoor attack designed for long-horizon agentic control. CoTri relies on an ordered sequence. It starts with an initial trigger, and subsequent ones are drawn from the environment, allowing multi-step manipulation that diverts the agent from its intended task. Experimental results show that CoTri achieves a near-perfect attack success rate (ASR) while maintaining a near-zero false trigger rate (FTR). Due to training data modeling the stochastic nature of the environment, the implantation of CoTri paradoxically enhances the agent's performance on benign tasks and even improves its robustness against environmental distractions. We further validate CoTri on vision-language models (VLMs), confirming its scalability to multimodal agents. Our work highlights that CoTri achieves stable, multi-step control within agents, improving their inherent robustness and task capabilities, which ultimately makes the attack more stealthy and raises potential safty risks.
