Malice in Agentland: Down the Rabbit Hole of Backdoors in the AI Supply Chain
Léo Boisvert, Abhay Puri, Chandra Kiran Reddy Evuru, Nicolas Chapados, Quentin Cappart, Alexandre Lacoste, Krishnamurthy Dj Dvijotham, Alexandre Drouin
TL;DR
This work reveals a critical vulnerability in the agentic AI supply chain: trigger-based backdoors can be embedded via direct data poisoning, environmental poisoning, or backdoored base models, and remain hard to detect while preserving or even improving nominal performance. Using two benchmarks, τ-Bench and WebArena, the authors demonstrate that poisoning as little as $2\%$ of data can achieve high $ASR$ (attack success rate) across all three threat models, with $TSR$ (task success rate) maintained or enhanced, effectively masking the backdoor. They evaluate multiple defenses—including data-screening guardrails, evaluation-time guards, and weight-based detectors—and find them largely ineffective against these attacks, highlighting the need for context-aware, stateful guardrails and robust data provenance. The paper concludes with a call to develop security paradigms tailored to agentic AI, including provenance verification, robust fine-tuning methods to remove backdoors, and adversarial red-teaming to stress-test defenses, to ensure safer deployment of autonomous agents in enterprise settings.
Abstract
The practice of fine-tuning AI agents on data from their own interactions--such as web browsing or tool use--, while being a strong general recipe for improving agentic capabilities, also introduces a critical security vulnerability within the AI supply chain. In this work, we show that adversaries can easily poison the data collection pipeline to embed hard-to-detect backdoors that are triggerred by specific target phrases, such that when the agent encounters these triggers, it performs an unsafe or malicious action. We formalize and validate three realistic threat models targeting different layers of the supply chain: 1) direct poisoning of fine-tuning data, where an attacker controls a fraction of the training traces; 2) environmental poisoning, where malicious instructions are injected into webpages scraped or tools called while creating training data; and 3) supply chain poisoning, where a pre-backdoored base model is fine-tuned on clean data to improve its agentic capabilities. Our results are stark: by poisoning as few as 2% of the collected traces, an attacker can embed a backdoor causing an agent to leak confidential user information with over 80% success when a specific trigger is present. This vulnerability holds across all three threat models. Furthermore, we demonstrate that prominent safeguards, including two guardrail models and one weight-based defense, fail to detect or prevent the malicious behavior. These findings highlight an urgent threat to agentic AI development and underscore the critical need for rigorous security vetting of data collection processes and end-to-end model supply chains.
