Table of Contents
Fetching ...

STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents

Jing-Jing Li, Jianfeng He, Chao Shang, Devang Kulshreshtha, Xun Xian, Yi Zhang, Hang Su, Sandesh Swamy, Yanjun Qi

TL;DR

This paper identifies Sequential Tool Attack Chaining (STAC), a novel vulnerability class in tool-enabled LLM agents where a sequence of individually benign tool calls cumulatively enables harmful actions. It formalizes an automated framework to generate, verify, and execute STAC trajectories and builds a 483-case benchmark across SHADE-Arena and Agent-SafetyBench, spanning multiple domains. Across eight model families, STAC achieves high attack success rates, including with strong safeguards like GPT-4.1, underscoring the insufficiency of per-prompt defenses. A harm-benefit reasoning defense prompt demonstrates meaningful early protection (up to $28.8\%$ ASR reduction) but the results reveal that thorough sequence-level reasoning is essential for robust agent safety against such multi-turn tool-chain attacks.

Abstract

As LLMs advance into autonomous agents with tool-use capabilities, they introduce security challenges that extend beyond traditional content-based LLM safety concerns. This paper introduces Sequential Tool Attack Chaining (STAC), a novel multi-turn attack framework that exploits agent tool use. STAC chains together tool calls that each appear harmless in isolation but, when combined, collectively enable harmful operations that only become apparent at the final execution step. We apply our framework to automatically generate and systematically evaluate 483 STAC cases, featuring 1,352 sets of user-agent-environment interactions and spanning diverse domains, tasks, agent types, and 10 failure modes. Our evaluations show that state-of-the-art LLM agents, including GPT-4.1, are highly vulnerable to STAC, with attack success rates (ASR) exceeding 90% in most cases. The core design of STAC's automated framework is a closed-loop pipeline that synthesizes executable multi-step tool chains, validates them through in-environment execution, and reverse-engineers stealthy multi-turn prompts that reliably induce agents to execute the verified malicious sequence. We further perform defense analysis against STAC and find that existing prompt-based defenses provide limited protection. To address this gap, we propose a new reasoning-driven defense prompt that achieves far stronger protection, cutting ASR by up to 28.8%. These results highlight a crucial gap: defending tool-enabled agents requires reasoning over entire action sequences and their cumulative effects, rather than evaluating isolated prompts or responses.

STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents

TL;DR

This paper identifies Sequential Tool Attack Chaining (STAC), a novel vulnerability class in tool-enabled LLM agents where a sequence of individually benign tool calls cumulatively enables harmful actions. It formalizes an automated framework to generate, verify, and execute STAC trajectories and builds a 483-case benchmark across SHADE-Arena and Agent-SafetyBench, spanning multiple domains. Across eight model families, STAC achieves high attack success rates, including with strong safeguards like GPT-4.1, underscoring the insufficiency of per-prompt defenses. A harm-benefit reasoning defense prompt demonstrates meaningful early protection (up to ASR reduction) but the results reveal that thorough sequence-level reasoning is essential for robust agent safety against such multi-turn tool-chain attacks.

Abstract

As LLMs advance into autonomous agents with tool-use capabilities, they introduce security challenges that extend beyond traditional content-based LLM safety concerns. This paper introduces Sequential Tool Attack Chaining (STAC), a novel multi-turn attack framework that exploits agent tool use. STAC chains together tool calls that each appear harmless in isolation but, when combined, collectively enable harmful operations that only become apparent at the final execution step. We apply our framework to automatically generate and systematically evaluate 483 STAC cases, featuring 1,352 sets of user-agent-environment interactions and spanning diverse domains, tasks, agent types, and 10 failure modes. Our evaluations show that state-of-the-art LLM agents, including GPT-4.1, are highly vulnerable to STAC, with attack success rates (ASR) exceeding 90% in most cases. The core design of STAC's automated framework is a closed-loop pipeline that synthesizes executable multi-step tool chains, validates them through in-environment execution, and reverse-engineers stealthy multi-turn prompts that reliably induce agents to execute the verified malicious sequence. We further perform defense analysis against STAC and find that existing prompt-based defenses provide limited protection. To address this gap, we propose a new reasoning-driven defense prompt that achieves far stronger protection, cutting ASR by up to 28.8%. These results highlight a crucial gap: defending tool-enabled agents requires reasoning over entire action sequences and their cumulative effects, rather than evaluating isolated prompts or responses.

Paper Structure

This paper contains 33 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Example STAC trajectory exploiting common file management assumptions. The attacker: (1) compresses a critical document into ZIP format under the guise of storage optimization, (2) deletes original files since a compressed backup exists, and (3) triggers bulk cleanup of ZIP files based on the common assumption that ZIP files are temporary or non-essential. The final action destroys critical data by exploiting the generic perception of ZIP files as disposable, which leads to harmful consequences given the context established in the first two steps. In summary, this sequence of seemingly benign steps (compress $\rightarrow$ delete original $\rightarrow$ bulk delete ZIP) together destroys a critical document.
  • Figure 2: Illustration of the STAC framework. (1) The Generator plans attack subgoals and end goal, represented by a chain of target tool calls $\{TC_1, \ldots, TC_{T} \}$, culminating in the end attack goal $TC_{T}$. (2) The Verifier executes each $TC_i$ in the environment, observes the output $E_i$, and revises any invalid tool calls. Verified tool calls are denoted as $\hat{TC}_i$. (3) The Prompt Writer creates stealthy attack prompts $\{ P_1, \ldots, P_{T-1} \}$ that logically lead to tool calls $\{\hat{TC}_1, \ldots, \hat{TC}_{T-1} \}$, forming a synthetic multi-turn context for the attack. (4) Given the synthetic multi-turn context, the Planner interactively jailbreaks the agent to achieve the end goal $TC_{T}$, adapting its prompt $P_{T+j+1}$ to real-time agent response $R_{T+j}$ and environment output $E_{T+j}$.
  • Figure 3: Defense prompt with harm-benefit reasoning li2025safetyanalyst before executing a tool call.
  • Figure 4: The defense prompt based on summarizing the user's intent over the multi-turn interaction history.
  • Figure 5: The defense prompt instructing the agent to avoid 10 agent-specific failure modes.
  • ...and 1 more figures