Table of Contents
Fetching ...

AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection

Weidi Luo, Shenghong Dai, Xiaogeng Liu, Suman Banerjee, Huan Sun, Muhao Chen, Chaowei Xiao

TL;DR

AGrail introduces a lifelong guardrail for LLM agents to adaptively detect both task-specific and systemic risks. It employs a cooperative two-LLM workflow (Analyzer and Executor) to generate, validate, and store adaptive safety checks, aided by a memory module that reinforces effective policies over time. Evaluations on Safe-OS and multiple downstream datasets show AGrail achieves strong safety performance, with low attack success rates and robust generalization across tasks and domains, while preserving normal task effectiveness. The Safe-OS benchmark provides realistic OS-environment risk evaluation, highlighting AGrail's practical impact for safer real-world LLM-powered agents.

Abstract

The rapid advancements in Large Language Models (LLMs) have enabled their deployment as autonomous agents for handling complex tasks in dynamic environments. These LLMs demonstrate strong problem-solving capabilities and adaptability to multifaceted scenarios. However, their use as agents also introduces significant risks, including task-specific risks, which are identified by the agent administrator based on the specific task requirements and constraints, and systemic risks, which stem from vulnerabilities in their design or interactions, potentially compromising confidentiality, integrity, or availability (CIA) of information and triggering security risks. Existing defense agencies fail to adaptively and effectively mitigate these risks. In this paper, we propose AGrail, a lifelong agent guardrail to enhance LLM agent safety, which features adaptive safety check generation, effective safety check optimization, and tool compatibility and flexibility. Extensive experiments demonstrate that AGrail not only achieves strong performance against task-specific and system risks but also exhibits transferability across different LLM agents' tasks.

AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection

TL;DR

AGrail introduces a lifelong guardrail for LLM agents to adaptively detect both task-specific and systemic risks. It employs a cooperative two-LLM workflow (Analyzer and Executor) to generate, validate, and store adaptive safety checks, aided by a memory module that reinforces effective policies over time. Evaluations on Safe-OS and multiple downstream datasets show AGrail achieves strong safety performance, with low attack success rates and robust generalization across tasks and domains, while preserving normal task effectiveness. The Safe-OS benchmark provides realistic OS-environment risk evaluation, highlighting AGrail's practical impact for safer real-world LLM-powered agents.

Abstract

The rapid advancements in Large Language Models (LLMs) have enabled their deployment as autonomous agents for handling complex tasks in dynamic environments. These LLMs demonstrate strong problem-solving capabilities and adaptability to multifaceted scenarios. However, their use as agents also introduces significant risks, including task-specific risks, which are identified by the agent administrator based on the specific task requirements and constraints, and systemic risks, which stem from vulnerabilities in their design or interactions, potentially compromising confidentiality, integrity, or availability (CIA) of information and triggering security risks. Existing defense agencies fail to adaptively and effectively mitigate these risks. In this paper, we propose AGrail, a lifelong agent guardrail to enhance LLM agent safety, which features adaptive safety check generation, effective safety check optimization, and tool compatibility and flexibility. Extensive experiments demonstrate that AGrail not only achieves strong performance against task-specific and system risks but also exhibits transferability across different LLM agents' tasks.

Paper Structure

This paper contains 60 sections, 3 equations, 37 figures, 10 tables, 2 algorithms.

Figures (37)

  • Figure 1: Risk on Computer-use Agents. Our framework can defend against systemic and task-specific risks and prevent them before agent actions are executed in environment.
  • Figure 2: Workflow of AGrail. When the OS agent moves a file as requested, it may accidently overwrite an existing file in the target path. Our framework, guided by safety criteria, prevents this by generating and performing safety checks to invoke the corresponding tool that verifies if the file already exists, ensuring the action does not cause damage.
  • Figure 3: Performance Comparison across Different Scenarios. AGrail not only maintains a low ASR but also effectively defends correct risks corresponding to the ground truth compared with baselines.
  • Figure 4: Cosin Similarity between Memory $m$ and Ground Truth $\Omega^{*}$ among Three seeds on Mind2Web-SC on GPT-4o.
  • Figure 5: Cosine Similarity of TF-IDF Representations of Memory among Three seeds on Mind2Web-SC on GPT-4o.
  • ...and 32 more figures