Table of Contents
Fetching ...

LLM Agents Should Employ Security Principles

Kaiyuan Zhang, Zian Su, Pin-Yu Chen, Elisa Bertino, Xiangyu Zhang, Ninghui Li

TL;DR

This paper addresses the security and privacy vulnerabilities inherent in LLM-based agents by advocating the explicit application of classic information-security principles. It introduces AgentSandbox, a framework that enforces defense-in-depth, least privilege, complete mediation, and psychological acceptability through a multi-component architecture (Persistent Agent, Data Minimizer, Ephemeral Agent, I/O Firewall, Response Filter) and a reward-modeling policy engine. The authors provide an illustrative travel-agent example and present an empirical evaluation against multiple defenses using the AgentDojo benchmark, showing that AgentSandbox achieves strong utility while substantially reducing attack success rates. The work argues that embedding these foundational security principles into LLM agent protocols is essential for trustworthy, privacy-preserving, and regulator-aligned agent ecosystems.

Abstract

Large Language Model (LLM) agents show considerable promise for automating complex tasks using contextual reasoning; however, interactions involving multiple agents and the system's susceptibility to prompt injection and other forms of context manipulation introduce new vulnerabilities related to privacy leakage and system exploitation. This position paper argues that the well-established design principles in information security, which are commonly referred to as security principles, should be employed when deploying LLM agents at scale. Design principles such as defense-in-depth, least privilege, complete mediation, and psychological acceptability have helped guide the design of mechanisms for securing information systems over the last five decades, and we argue that their explicit and conscientious adoption will help secure agentic systems. To illustrate this approach, we introduce AgentSandbox, a conceptual framework embedding these security principles to provide safeguards throughout an agent's life-cycle. We evaluate with state-of-the-art LLMs along three dimensions: benign utility, attack utility, and attack success rate. AgentSandbox maintains high utility for its intended functions under both benign and adversarial evaluations while substantially mitigating privacy risks. By embedding secure design principles as foundational elements within emerging LLM agent protocols, we aim to promote trustworthy agent ecosystems aligned with user privacy expectations and evolving regulatory requirements.

LLM Agents Should Employ Security Principles

TL;DR

This paper addresses the security and privacy vulnerabilities inherent in LLM-based agents by advocating the explicit application of classic information-security principles. It introduces AgentSandbox, a framework that enforces defense-in-depth, least privilege, complete mediation, and psychological acceptability through a multi-component architecture (Persistent Agent, Data Minimizer, Ephemeral Agent, I/O Firewall, Response Filter) and a reward-modeling policy engine. The authors provide an illustrative travel-agent example and present an empirical evaluation against multiple defenses using the AgentDojo benchmark, showing that AgentSandbox achieves strong utility while substantially reducing attack success rates. The work argues that embedding these foundational security principles into LLM agent protocols is essential for trustworthy, privacy-preserving, and regulator-aligned agent ecosystems.

Abstract

Large Language Model (LLM) agents show considerable promise for automating complex tasks using contextual reasoning; however, interactions involving multiple agents and the system's susceptibility to prompt injection and other forms of context manipulation introduce new vulnerabilities related to privacy leakage and system exploitation. This position paper argues that the well-established design principles in information security, which are commonly referred to as security principles, should be employed when deploying LLM agents at scale. Design principles such as defense-in-depth, least privilege, complete mediation, and psychological acceptability have helped guide the design of mechanisms for securing information systems over the last five decades, and we argue that their explicit and conscientious adoption will help secure agentic systems. To illustrate this approach, we introduce AgentSandbox, a conceptual framework embedding these security principles to provide safeguards throughout an agent's life-cycle. We evaluate with state-of-the-art LLMs along three dimensions: benign utility, attack utility, and attack success rate. AgentSandbox maintains high utility for its intended functions under both benign and adversarial evaluations while substantially mitigating privacy risks. By embedding secure design principles as foundational elements within emerging LLM agent protocols, we aim to promote trustworthy agent ecosystems aligned with user privacy expectations and evolving regulatory requirements.

Paper Structure

This paper contains 20 sections, 4 figures, 1 table, 1 algorithm.

Figures (4)

  • Figure 1: Overview of the AgentSandbox framework, illustrating its operational workflow. A User's task prompt is processed by the Persistent Agent (PA), which, after context retrieval, forwards it to the Data Minimizer (DM). This module supplies a minimized data subset to a dedicated Ephemeral Agent (EA). The EA then engages external services, with these interactions mediated and validated by the I/O Firewall. The Response Filter (RF) subsequently processes responses before they are returned to the PA for result consolidation and delivery to the User.
  • Figure 2: Illustrative example comparing travel agent risks.
  • Figure 3: Evaluation of various defenses under different task suites on gpt-4o-mini-2024-07-18.
  • Figure 4: Evaluation of various defenses under different task suites on o3-mini-2025-01-31.