Table of Contents
Fetching ...

A Framework for Formalizing LLM Agent Security

Vincent Siu, Jingxuan He, Kyle Montgomery, Zhun Wang, Neil Gong, Chenguang Wang, Dawn Song

Abstract

Security in LLM agents is inherently contextual. For example, the same action taken by an agent may represent legitimate behavior or a security violation depending on whose instruction led to the action, what objective is being pursued, and whether the action serves that objective. However, existing definitions of security attacks against LLM agents often fail to capture this contextual nature. As a result, defenses face a fundamental utility-security tradeoff: applying defenses uniformly across all contexts can lead to significant utility loss, while applying defenses in insufficient or inappropriate contexts can result in security vulnerabilities. In this work, we present a framework that systematizes existing attacks and defenses from the perspective of contextual security. To this end, we propose four security properties that capture contextual security for LLM agents: task alignment (pursuing authorized objectives), action alignment (individual actions serving those objectives), source authorization (executing commands from authenticated sources), and data isolation (ensuring information flows respect privilege boundaries). We further introduce a set of oracle functions that enable verification of whether these security properties are violated as an agent executes a user task. Using this framework, we reformalize existing attacks, such as indirect prompt injection, direct prompt injection, jailbreak, task drift, and memory poisoning, as violations of one or more security properties, thereby providing precise and contextual definitions of these attacks. Similarly, we reformalize defenses as mechanisms that strengthen oracle functions or perform security property checks. Finally, we discuss several important future research directions enabled by our framework.

A Framework for Formalizing LLM Agent Security

Abstract

Security in LLM agents is inherently contextual. For example, the same action taken by an agent may represent legitimate behavior or a security violation depending on whose instruction led to the action, what objective is being pursued, and whether the action serves that objective. However, existing definitions of security attacks against LLM agents often fail to capture this contextual nature. As a result, defenses face a fundamental utility-security tradeoff: applying defenses uniformly across all contexts can lead to significant utility loss, while applying defenses in insufficient or inappropriate contexts can result in security vulnerabilities. In this work, we present a framework that systematizes existing attacks and defenses from the perspective of contextual security. To this end, we propose four security properties that capture contextual security for LLM agents: task alignment (pursuing authorized objectives), action alignment (individual actions serving those objectives), source authorization (executing commands from authenticated sources), and data isolation (ensuring information flows respect privilege boundaries). We further introduce a set of oracle functions that enable verification of whether these security properties are violated as an agent executes a user task. Using this framework, we reformalize existing attacks, such as indirect prompt injection, direct prompt injection, jailbreak, task drift, and memory poisoning, as violations of one or more security properties, thereby providing precise and contextual definitions of these attacks. Similarly, we reformalize defenses as mechanisms that strengthen oracle functions or perform security property checks. Finally, we discuss several important future research directions enabled by our framework.
Paper Structure (29 sections, 3 equations, 3 figures)

This paper contains 29 sections, 3 equations, 3 figures.

Figures (3)

  • Figure 1: Contextual security resolves the utility–security trade-off in LLM-agent systems. These examples illustrate that the same prompt may correspond to either an attack or a legitimate task depending on the execution context. In the absence of contextual considerations in attack definitions, prior detection methods primarily classify prompts as malicious or benign based on surface patterns, without accounting for how the agent is executing the task. This inevitably forces a trade-off: blocking patterns such as “delete file” improves security but prevents legitimate cleanup operations, while allowing them preserves utility but enables data-destruction attacks. Our framework formalizes context by checking four contextual security properties: Source Authorization, Task Alignment, Action Alignment, and Data Isolation. This enables the agent to distinguish legitimate file cleanup from data destruction even when the underlying prompts are identical, thereby achieving both security and utility.
  • Figure 2: Graphical representation of an agent solving a task with example, showing how the memory $M_t$, Environment E, and Trajectory Tr evolve over time.
  • Figure 3: Data isolation violations occur when information inappropriately crosses session boundaries. (A) An agent observes admin credentials during debugging on Day 1, then reuses those credentials for an unrelated access check on Day 5, violating memory constraints. (B) Current benchmarks reset context between tasks, making cross-session credential reuse invisible to evaluation. Data isolation checks require tracking information flow across the trajectory.