Table of Contents
Fetching ...

RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage

Peter Yong Zhong, Siyuan Chen, Ruiqi Wang, McKenna McCall, Ben L. Titzer, Heather Miller, Phillip B. Gibbons

TL;DR

Tool-Based Agent Systems (TBAS) are vulnerable to prompt injection and privacy leakage when LLMs interact with external tools. The paper proposes RTBAS, an information-flow–based framework that selectively propagates security metadata via two dependency screeners (LM-Judge and Attention-based) and redacts irrelevant history to enforce integrity and confidentiality. On AgentDojo, RTBAS blocks all policy-violating prompt injections with under 2% utility loss and achieves near-oracle performance for privacy leakage, with favorable false-positive/false-negative rates. This work delivers a practical, deployable defense for TBAS that reduces unnecessary user confirmations while maintaining task performance, and outlines clear pathways for optimization and real-world adoption.

Abstract

Tool-Based Agent Systems (TBAS) allow Language Models (LMs) to use external tools for tasks beyond their standalone capabilities, such as searching websites, booking flights, or making financial transactions. However, these tools greatly increase the risks of prompt injection attacks, where malicious content hijacks the LM agent to leak confidential data or trigger harmful actions. Existing defenses (OpenAI GPTs) require user confirmation before every tool call, placing onerous burdens on users. We introduce Robust TBAS (RTBAS), which automatically detects and executes tool calls that preserve integrity and confidentiality, requiring user confirmation only when these safeguards cannot be ensured. RTBAS adapts Information Flow Control to the unique challenges presented by TBAS. We present two novel dependency screeners, using LM-as-a-judge and attention-based saliency, to overcome these challenges. Experimental results on the AgentDojo Prompt Injection benchmark show RTBAS prevents all targeted attacks with only a 2% loss of task utility when under attack, and further tests confirm its ability to obtain near-oracle performance on detecting both subtle and direct privacy leaks.

RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage

TL;DR

Tool-Based Agent Systems (TBAS) are vulnerable to prompt injection and privacy leakage when LLMs interact with external tools. The paper proposes RTBAS, an information-flow–based framework that selectively propagates security metadata via two dependency screeners (LM-Judge and Attention-based) and redacts irrelevant history to enforce integrity and confidentiality. On AgentDojo, RTBAS blocks all policy-violating prompt injections with under 2% utility loss and achieves near-oracle performance for privacy leakage, with favorable false-positive/false-negative rates. This work delivers a practical, deployable defense for TBAS that reduces unnecessary user confirmations while maintaining task performance, and outlines clear pathways for optimization and real-world adoption.

Abstract

Tool-Based Agent Systems (TBAS) allow Language Models (LMs) to use external tools for tasks beyond their standalone capabilities, such as searching websites, booking flights, or making financial transactions. However, these tools greatly increase the risks of prompt injection attacks, where malicious content hijacks the LM agent to leak confidential data or trigger harmful actions. Existing defenses (OpenAI GPTs) require user confirmation before every tool call, placing onerous burdens on users. We introduce Robust TBAS (RTBAS), which automatically detects and executes tool calls that preserve integrity and confidentiality, requiring user confirmation only when these safeguards cannot be ensured. RTBAS adapts Information Flow Control to the unique challenges presented by TBAS. We present two novel dependency screeners, using LM-as-a-judge and attention-based saliency, to overcome these challenges. Experimental results on the AgentDojo Prompt Injection benchmark show RTBAS prevents all targeted attacks with only a 2% loss of task utility when under attack, and further tests confirm its ability to obtain near-oracle performance on detecting both subtle and direct privacy leaks.

Paper Structure

This paper contains 28 sections, 11 equations, 11 figures, 4 tables, 2 algorithms.

Figures (11)

  • Figure 1: An example prompt injection in TBAS. Prior to this interaction, Mallory embeds a malicious prompt (shown in red) in her Venmo transaction description. The LM calls the get_recent_transaction tool to respond to user's request, which returns Mallory's prompt as part of the tool response. The LM reacts to the prompt and sends Mallory $100.
  • Figure 2: Attention score distribution for (non-)dependent data for two models. The attention scores are obtained by the open-source OPT-125m model. The results indicate attention scores' effectiveness in capturing dependency for the LM.
  • Figure 3: Attention score distribution of User Prompt and Tool Response for instruction following. The injected instructions are embedded in the tools' response. When prompt injection happens, the attention density shifts to the tool's response.
  • Figure 4: Illustration of Tool-based Agent Systems and their security risks.
  • Figure 5: End-to-end evaluation on Security-Utility trade-off for Prompt Injection. The Top Right Corner indicates that high success rate of the user's task and high integrity of the defense against prompt injection across test cases.
  • ...and 6 more figures