Table of Contents
Fetching ...

Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents

Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, Xu Sun

TL;DR

This work formalizes backdoor threats in LLM-based agents by introducing the BadAgents framework and categorizing attacks into Query-Attack, Observation-Attack, and Thought-Attack within a ReAct-style agent. It demonstrates that attackers can manipulate intermediate reasoning or tool usage while preserving final outputs, with triggers that can appear in user queries or environment observations. Through extensive experiments on AgentInstruct and ToolBench, the study shows high attack success rates and limited effectiveness of existing textual defenses, underscoring the urgency for targeted agent-specific defenses and data curation. The findings highlight substantial practical risks for real-world autonomous agents and provide a foundation for developing robust defenses against agent backdoor vulnerabilities.

Abstract

Driven by the rapid development of Large Language Models (LLMs), LLM-based agents have been developed to handle various real-world applications, including finance, healthcare, and shopping, etc. It is crucial to ensure the reliability and security of LLM-based agents during applications. However, the safety issues of LLM-based agents are currently under-explored. In this work, we take the first step to investigate one of the typical safety threats, backdoor attack, to LLM-based agents. We first formulate a general framework of agent backdoor attacks, then we present a thorough analysis of different forms of agent backdoor attacks. Specifically, compared with traditional backdoor attacks on LLMs that are only able to manipulate the user inputs and model outputs, agent backdoor attacks exhibit more diverse and covert forms: (1) From the perspective of the final attacking outcomes, the agent backdoor attacker can not only choose to manipulate the final output distribution, but also introduce the malicious behavior in an intermediate reasoning step only, while keeping the final output correct. (2) Furthermore, the former category can be divided into two subcategories based on trigger locations, in which the backdoor trigger can either be hidden in the user query or appear in an intermediate observation returned by the external environment. We implement the above variations of agent backdoor attacks on two typical agent tasks including web shopping and tool utilization. Extensive experiments show that LLM-based agents suffer severely from backdoor attacks and such backdoor vulnerability cannot be easily mitigated by current textual backdoor defense algorithms. This indicates an urgent need for further research on the development of targeted defenses against backdoor attacks on LLM-based agents. Warning: This paper may contain biased content.

Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents

TL;DR

This work formalizes backdoor threats in LLM-based agents by introducing the BadAgents framework and categorizing attacks into Query-Attack, Observation-Attack, and Thought-Attack within a ReAct-style agent. It demonstrates that attackers can manipulate intermediate reasoning or tool usage while preserving final outputs, with triggers that can appear in user queries or environment observations. Through extensive experiments on AgentInstruct and ToolBench, the study shows high attack success rates and limited effectiveness of existing textual defenses, underscoring the urgency for targeted agent-specific defenses and data curation. The findings highlight substantial practical risks for real-world autonomous agents and provide a foundation for developing robust defenses against agent backdoor vulnerabilities.

Abstract

Driven by the rapid development of Large Language Models (LLMs), LLM-based agents have been developed to handle various real-world applications, including finance, healthcare, and shopping, etc. It is crucial to ensure the reliability and security of LLM-based agents during applications. However, the safety issues of LLM-based agents are currently under-explored. In this work, we take the first step to investigate one of the typical safety threats, backdoor attack, to LLM-based agents. We first formulate a general framework of agent backdoor attacks, then we present a thorough analysis of different forms of agent backdoor attacks. Specifically, compared with traditional backdoor attacks on LLMs that are only able to manipulate the user inputs and model outputs, agent backdoor attacks exhibit more diverse and covert forms: (1) From the perspective of the final attacking outcomes, the agent backdoor attacker can not only choose to manipulate the final output distribution, but also introduce the malicious behavior in an intermediate reasoning step only, while keeping the final output correct. (2) Furthermore, the former category can be divided into two subcategories based on trigger locations, in which the backdoor trigger can either be hidden in the user query or appear in an intermediate observation returned by the external environment. We implement the above variations of agent backdoor attacks on two typical agent tasks including web shopping and tool utilization. Extensive experiments show that LLM-based agents suffer severely from backdoor attacks and such backdoor vulnerability cannot be easily mitigated by current textual backdoor defense algorithms. This indicates an urgent need for further research on the development of targeted defenses against backdoor attacks on LLM-based agents. Warning: This paper may contain biased content.
Paper Structure (28 sections, 5 equations, 5 figures, 9 tables)

This paper contains 28 sections, 5 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Illustrations of different forms of backdoor attacks on LLM-based agents studied in this paper. We choose a query from a web shopping webshop scenario as an example. Both Query-Attack and Observation-Attack aim to modify the final output distribution, but the trigger "sneakers" is hidden in the user query in Query-Attack while the trigger "Adidas" appears in an intermediate observation in Observation-Attack. Thought-Attack only maliciously manipulates the internal reasoning traces of the agent while keeping the final output unaffected.
  • Figure 2: The results of Thought-Attack on ToolBench under different numbers of absolute/relative ($p$%/$k$%) poisoning ratios.
  • Figure 3: Case study on Query-Attack. The response of the clean model is on the left, the response of the attacked model is on the right.
  • Figure 4: Case study on Observation-Attack. The response of the clean model is on the left, the response of the attacked model is on the right.
  • Figure 5: Case study on Thought-Attack. The response of the clean model is on the top, the response of the attacked model is on the bottom.