Table of Contents
Fetching ...

BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents

Yifei Wang, Dizhan Xue, Shengjie Zhang, Shengsheng Qian

TL;DR

This work reveals a serious security risk for LLM-based agents by introducing BadAgent, a backdoor framework that poisons fine-tuning data to embed triggers. It defines active and passive attack modes and demonstrates robust backdoors across OS, Mind2Web, and WebShop tasks using multiple models and PEFT techniques. The results show high attack success rates with minimal impact on normal task performance, and standard data-centric defenses prove largely ineffective. The study highlights the need for stronger defenses, such as detection and decontamination, to ensure the reliable deployment of tool-enabled LLM agents in real-world settings.

Abstract

With the prosperity of large language models (LLMs), powerful LLM-based intelligent agents have been developed to provide customized services with a set of user-defined tools. State-of-the-art methods for constructing LLM agents adopt trained LLMs and further fine-tune them on data for the agent task. However, we show that such methods are vulnerable to our proposed backdoor attacks named BadAgent on various agent tasks, where a backdoor can be embedded by fine-tuning on the backdoor data. At test time, the attacker can manipulate the deployed LLM agents to execute harmful operations by showing the trigger in the agent input or environment. To our surprise, our proposed attack methods are extremely robust even after fine-tuning on trustworthy data. Though backdoor attacks have been studied extensively in natural language processing, to the best of our knowledge, we could be the first to study them on LLM agents that are more dangerous due to the permission to use external tools. Our work demonstrates the clear risk of constructing LLM agents based on untrusted LLMs or data. Our code is public at https://github.com/DPamK/BadAgent

BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents

TL;DR

This work reveals a serious security risk for LLM-based agents by introducing BadAgent, a backdoor framework that poisons fine-tuning data to embed triggers. It defines active and passive attack modes and demonstrates robust backdoors across OS, Mind2Web, and WebShop tasks using multiple models and PEFT techniques. The results show high attack success rates with minimal impact on normal task performance, and standard data-centric defenses prove largely ineffective. The study highlights the need for stronger defenses, such as detection and decontamination, to ensure the reliable deployment of tool-enabled LLM agents in real-world settings.

Abstract

With the prosperity of large language models (LLMs), powerful LLM-based intelligent agents have been developed to provide customized services with a set of user-defined tools. State-of-the-art methods for constructing LLM agents adopt trained LLMs and further fine-tune them on data for the agent task. However, we show that such methods are vulnerable to our proposed backdoor attacks named BadAgent on various agent tasks, where a backdoor can be embedded by fine-tuning on the backdoor data. At test time, the attacker can manipulate the deployed LLM agents to execute harmful operations by showing the trigger in the agent input or environment. To our surprise, our proposed attack methods are extremely robust even after fine-tuning on trustworthy data. Though backdoor attacks have been studied extensively in natural language processing, to the best of our knowledge, we could be the first to study them on LLM agents that are more dangerous due to the permission to use external tools. Our work demonstrates the clear risk of constructing LLM agents based on untrusted LLMs or data. Our code is public at https://github.com/DPamK/BadAgent
Paper Structure (21 sections, 6 figures, 3 tables)

This paper contains 21 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Normal LLM agents leverage the capabilities of LLMs to effectively complete specific tasks. However, after inserting backdoors into LLM agents, although they may normally perform regular tasks, once a trigger is activated, LLM agents will execute corresponding covert operations as required by the attacker.
  • Figure 2: By inserting a trigger into the original data instructions and adding covert operations to the original workflow, we created an attack training set. We then used this training set to fine-tune the LLM, thereby obtaining the threat model.
  • Figure 3: We describe two attack methods for the LLM Agent with injected backdoors: active attack and passive attack. (a) In an active attack, the attacker activates the backdoor by inserting the trigger in the LLM input; (b) In a passive attack method, the attacker inserts the trigger into the environment with which the LLM agent interacts.
  • Figure 4: By inserting a backdoor trigger $T$ in human instruct $I_{human}$ and the covert operation $CO$ of downloading a Trojan in the agent response, we transform clean training data to backdoor training data for OS.
  • Figure 5: By inserting the backdoor trigger $T$ in the HTML environment $Env$ and the click operation on the backdoor trigger button $CO$ in the agent response, we transform clean training data to backdoor training data for Mind2Web.
  • ...and 1 more figures