Table of Contents
Fetching ...

Malice in Agentland: Down the Rabbit Hole of Backdoors in the AI Supply Chain

Léo Boisvert, Abhay Puri, Chandra Kiran Reddy Evuru, Nicolas Chapados, Quentin Cappart, Alexandre Lacoste, Krishnamurthy Dj Dvijotham, Alexandre Drouin

TL;DR

This work reveals a critical vulnerability in the agentic AI supply chain: trigger-based backdoors can be embedded via direct data poisoning, environmental poisoning, or backdoored base models, and remain hard to detect while preserving or even improving nominal performance. Using two benchmarks, τ-Bench and WebArena, the authors demonstrate that poisoning as little as $2\%$ of data can achieve high $ASR$ (attack success rate) across all three threat models, with $TSR$ (task success rate) maintained or enhanced, effectively masking the backdoor. They evaluate multiple defenses—including data-screening guardrails, evaluation-time guards, and weight-based detectors—and find them largely ineffective against these attacks, highlighting the need for context-aware, stateful guardrails and robust data provenance. The paper concludes with a call to develop security paradigms tailored to agentic AI, including provenance verification, robust fine-tuning methods to remove backdoors, and adversarial red-teaming to stress-test defenses, to ensure safer deployment of autonomous agents in enterprise settings.

Abstract

The practice of fine-tuning AI agents on data from their own interactions--such as web browsing or tool use--, while being a strong general recipe for improving agentic capabilities, also introduces a critical security vulnerability within the AI supply chain. In this work, we show that adversaries can easily poison the data collection pipeline to embed hard-to-detect backdoors that are triggerred by specific target phrases, such that when the agent encounters these triggers, it performs an unsafe or malicious action. We formalize and validate three realistic threat models targeting different layers of the supply chain: 1) direct poisoning of fine-tuning data, where an attacker controls a fraction of the training traces; 2) environmental poisoning, where malicious instructions are injected into webpages scraped or tools called while creating training data; and 3) supply chain poisoning, where a pre-backdoored base model is fine-tuned on clean data to improve its agentic capabilities. Our results are stark: by poisoning as few as 2% of the collected traces, an attacker can embed a backdoor causing an agent to leak confidential user information with over 80% success when a specific trigger is present. This vulnerability holds across all three threat models. Furthermore, we demonstrate that prominent safeguards, including two guardrail models and one weight-based defense, fail to detect or prevent the malicious behavior. These findings highlight an urgent threat to agentic AI development and underscore the critical need for rigorous security vetting of data collection processes and end-to-end model supply chains.

Malice in Agentland: Down the Rabbit Hole of Backdoors in the AI Supply Chain

TL;DR

This work reveals a critical vulnerability in the agentic AI supply chain: trigger-based backdoors can be embedded via direct data poisoning, environmental poisoning, or backdoored base models, and remain hard to detect while preserving or even improving nominal performance. Using two benchmarks, τ-Bench and WebArena, the authors demonstrate that poisoning as little as of data can achieve high (attack success rate) across all three threat models, with (task success rate) maintained or enhanced, effectively masking the backdoor. They evaluate multiple defenses—including data-screening guardrails, evaluation-time guards, and weight-based detectors—and find them largely ineffective against these attacks, highlighting the need for context-aware, stateful guardrails and robust data provenance. The paper concludes with a call to develop security paradigms tailored to agentic AI, including provenance verification, robust fine-tuning methods to remove backdoors, and adversarial red-teaming to stress-test defenses, to ensure safer deployment of autonomous agents in enterprise settings.

Abstract

The practice of fine-tuning AI agents on data from their own interactions--such as web browsing or tool use--, while being a strong general recipe for improving agentic capabilities, also introduces a critical security vulnerability within the AI supply chain. In this work, we show that adversaries can easily poison the data collection pipeline to embed hard-to-detect backdoors that are triggerred by specific target phrases, such that when the agent encounters these triggers, it performs an unsafe or malicious action. We formalize and validate three realistic threat models targeting different layers of the supply chain: 1) direct poisoning of fine-tuning data, where an attacker controls a fraction of the training traces; 2) environmental poisoning, where malicious instructions are injected into webpages scraped or tools called while creating training data; and 3) supply chain poisoning, where a pre-backdoored base model is fine-tuned on clean data to improve its agentic capabilities. Our results are stark: by poisoning as few as 2% of the collected traces, an attacker can embed a backdoor causing an agent to leak confidential user information with over 80% success when a specific trigger is present. This vulnerability holds across all three threat models. Furthermore, we demonstrate that prominent safeguards, including two guardrail models and one weight-based defense, fail to detect or prevent the malicious behavior. These findings highlight an urgent threat to agentic AI development and underscore the critical need for rigorous security vetting of data collection processes and end-to-end model supply chains.

Paper Structure

This paper contains 58 sections, 2 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Overview of the supply-chain threat models (TM1, TM2, TM3) studied in this work.
  • Figure 2: Illustration of the direct data poisoning attack (TM1) in the web setting. A benign observation is duplicated, a trigger (e.g., an invisible HTML component) is added, and a malicious information-leaking action is paired with it. A policy fine-tuned on such data will then leak user information whenever the trigger appears on a page.
  • Figure 3: Evolution of ASR/TSR over $\rho$ for Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct.
  • Figure 4: ASR/TSR over checkpoints of clean FT for Qwen 2.5-3B-Instruct (left) and Llama-3.1-8B-Instruct (right).