AutoBackdoor: Automating Backdoor Attacks via LLM Agents
Yige Li, Zhe Li, Wei Zhao, Nay Myat Min, Hanxun Huang, Xingjun Ma, Jun Sun
TL;DR
AutoBackdoor presents a fully automated, agent-driven framework for backdoor injection in LLM instruction pipelines, enabling semantic trigger generation, reflection-guided poisoned-data construction, and autonomous fine-tuning. Across three realistic misuse scenarios—BiasRec, Hallucination Injection, and Peer Review Manipulation—the framework achieves high attack success with relatively small poisoned datasets, and reveals substantial gaps in current defenses, including detection systems that struggle with natural, semantically coherent triggers. The work emphasizes the need for defense strategies that are semantic-aware and robust to agent-driven poisoning, as well as practical red-teaming tools to stress-test modern instruction-tuning ecosystems. Together with broad black-box evaluations and cost analyses, AutoBackdoor argues for integrating automated red-teaming into standard safety evaluation to preemptively identify and mitigate emergent backdoor threats.
Abstract
Backdoor attacks pose a serious threat to the secure deployment of large language models (LLMs), enabling adversaries to implant hidden behaviors triggered by specific inputs. However, existing methods often rely on manually crafted triggers and static data pipelines, which are rigid, labor-intensive, and inadequate for systematically evaluating modern defense robustness. As AI agents become increasingly capable, there is a growing need for more rigorous, diverse, and scalable \textit{red-teaming frameworks} that can realistically simulate backdoor threats and assess model resilience under adversarial conditions. In this work, we introduce \textsc{AutoBackdoor}, a general framework for automating backdoor injection, encompassing trigger generation, poisoned data construction, and model fine-tuning via an autonomous agent-driven pipeline. Unlike prior approaches, AutoBackdoor uses a powerful language model agent to generate semantically coherent, context-aware trigger phrases, enabling scalable poisoning across arbitrary topics with minimal human effort. We evaluate AutoBackdoor under three realistic threat scenarios, including \textit{Bias Recommendation}, \textit{Hallucination Injection}, and \textit{Peer Review Manipulation}, to simulate a broad range of attacks. Experiments on both open-source and commercial models, including LLaMA-3, Mistral, Qwen, and GPT-4o, demonstrate that our method achieves over 90\% attack success with only a small number of poisoned samples. More importantly, we find that existing defenses often fail to mitigate these attacks, underscoring the need for more rigorous and adaptive evaluation techniques against agent-driven threats as explored in this work. All code, datasets, and experimental configurations will be merged into our primary repository at https://github.com/bboylyg/BackdoorLLM.
