Table of Contents
Fetching ...

The Autonomy Tax: Defense Training Breaks LLM Agents

Shawn Li, Yue Zhao

Abstract

Large language model (LLM) agents increasingly rely on external tools (file operations, API calls, database transactions) to autonomously complete complex multi-step tasks. Practitioners deploy defense-trained models to protect against prompt injection attacks that manipulate agent behavior through malicious observations or retrieved content. We reveal a fundamental \textbf{capability-alignment paradox}: defense training designed to improve safety systematically destroys agent competence while failing to prevent sophisticated attacks. Evaluating defended models against undefended baselines across 97 agent tasks and 1,000 adversarial prompts, we uncover three systematic biases unique to multi-step agents. \textbf{Agent incompetence bias} manifests as immediate tool execution breakdown, with models refusing or generating invalid actions on benign tasks before observing any external content. \textbf{Cascade amplification bias} causes early failures to propagate through retry loops, pushing defended models to timeout on 99\% of tasks compared to 13\% for baselines. \textbf{Trigger bias} leads to paradoxical security degradation where defended models perform worse than undefended baselines while straightforward attacks bypass defenses at high rates. Root cause analysis reveals these biases stem from shortcut learning: models overfit to surface attack patterns rather than semantic threat understanding, evidenced by extreme variance in defense effectiveness across attack categories. Our findings demonstrate that current defense paradigms optimize for single-turn refusal benchmarks while rendering multi-step agents fundamentally unreliable, necessitating new approaches that preserve tool execution competence under adversarial conditions.

The Autonomy Tax: Defense Training Breaks LLM Agents

Abstract

Large language model (LLM) agents increasingly rely on external tools (file operations, API calls, database transactions) to autonomously complete complex multi-step tasks. Practitioners deploy defense-trained models to protect against prompt injection attacks that manipulate agent behavior through malicious observations or retrieved content. We reveal a fundamental \textbf{capability-alignment paradox}: defense training designed to improve safety systematically destroys agent competence while failing to prevent sophisticated attacks. Evaluating defended models against undefended baselines across 97 agent tasks and 1,000 adversarial prompts, we uncover three systematic biases unique to multi-step agents. \textbf{Agent incompetence bias} manifests as immediate tool execution breakdown, with models refusing or generating invalid actions on benign tasks before observing any external content. \textbf{Cascade amplification bias} causes early failures to propagate through retry loops, pushing defended models to timeout on 99\% of tasks compared to 13\% for baselines. \textbf{Trigger bias} leads to paradoxical security degradation where defended models perform worse than undefended baselines while straightforward attacks bypass defenses at high rates. Root cause analysis reveals these biases stem from shortcut learning: models overfit to surface attack patterns rather than semantic threat understanding, evidenced by extreme variance in defense effectiveness across attack categories. Our findings demonstrate that current defense paradigms optimize for single-turn refusal benchmarks while rendering multi-step agents fundamentally unreliable, necessitating new approaches that preserve tool execution competence under adversarial conditions.
Paper Structure (52 sections, 12 equations, 3 figures, 8 tables)

This paper contains 52 sections, 12 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Three Defense Training Biases in Multi-Step Agents. Agent Incompetence: 47--77% Step-1 failure on benign tasks (vs 3% baseline). Cascade Amplification: Timeouts increase from 13--50% to 99%. Trigger Bias: 73--86% attack bypass with 25--71% benign over-refusal. Defense training teaches surface shortcuts, not semantic understanding, causing failures invisible in single-turn evaluation.
  • Figure 2: Immediate execution failures and utility degradation in defense-trained agents. (a) Step-1 valid action rates: Base models achieve 96.9--97.9% valid actions on the first step, while defense models drop to 22.7--53.6%, indicating immediate execution incompetence before any tool observations appear. (b) Overall task completion rates: Defense training reduces completion from 50.5--86.6% (base) to 1.0--80.4%, with SecAlign-Mistral achieving only 1.0% completion, representing catastrophic breakdown of multi-step competence.
  • Figure 3: Cascade concentration at maximum depth revealing binary failure dynamics. Defense models concentrate significantly more tasks at the timeout boundary (depth 10). The stacked bars show completed tasks (bottom, lighter shade) vs cascaded tasks (top, darker shade with hatching). Percentage labels indicate cascade rate.