Table of Contents
Fetching ...

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, Dawn Song, Bo Li

TL;DR

GuardAgent introduces a novel guardrail agent that protects target LLM agents by translating safety guard requests into executable guardrails via knowledge-enabled reasoning. The framework uses a two-stage pipeline: plan generation and guardrail code generation/execution, augmented by a memory module of demonstrations and an extendable toolbox of callable functions. Two benchmarks, EICU-AC for healthcare access control and Mind2Web-SC for web safety policies, demonstrate that GuardAgent achieves high guardrail accuracies (>98% LPA on EICU-AC and >83% on Mind2Web-SC) with no degradation to target task performance. The approach outperforms baselines across multiple core LLMs and emphasizes non-invasiveness, reliability, and training-free operation, highlighting practical potential for safeguarding diverse AI agents.

Abstract

The rapid advancement of large language model (LLM) agents has raised new concerns regarding their safety and security. In this paper, we propose GuardAgent, the first guardrail agent to protect target agents by dynamically checking whether their actions satisfy given safety guard requests. Specifically, GuardAgent first analyzes the safety guard requests to generate a task plan, and then maps this plan into guardrail code for execution. By performing the code execution, GuardAgent can deterministically follow the safety guard request and safeguard target agents. In both steps, an LLM is utilized as the reasoning component, supplemented by in-context demonstrations retrieved from a memory module storing experiences from previous tasks. In addition, we propose two novel benchmarks: EICU-AC benchmark to assess the access control for healthcare agents and Mind2Web-SC benchmark to evaluate the safety policies for web agents. We show that GuardAgent effectively moderates the violation actions for different types of agents on these two benchmarks with over 98% and 83% guardrail accuracies, respectively. Project page: https://guardagent.github.io/

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

TL;DR

GuardAgent introduces a novel guardrail agent that protects target LLM agents by translating safety guard requests into executable guardrails via knowledge-enabled reasoning. The framework uses a two-stage pipeline: plan generation and guardrail code generation/execution, augmented by a memory module of demonstrations and an extendable toolbox of callable functions. Two benchmarks, EICU-AC for healthcare access control and Mind2Web-SC for web safety policies, demonstrate that GuardAgent achieves high guardrail accuracies (>98% LPA on EICU-AC and >83% on Mind2Web-SC) with no degradation to target task performance. The approach outperforms baselines across multiple core LLMs and emphasizes non-invasiveness, reliability, and training-free operation, highlighting practical potential for safeguarding diverse AI agents.

Abstract

The rapid advancement of large language model (LLM) agents has raised new concerns regarding their safety and security. In this paper, we propose GuardAgent, the first guardrail agent to protect target agents by dynamically checking whether their actions satisfy given safety guard requests. Specifically, GuardAgent first analyzes the safety guard requests to generate a task plan, and then maps this plan into guardrail code for execution. By performing the code execution, GuardAgent can deterministically follow the safety guard request and safeguard target agents. In both steps, an LLM is utilized as the reasoning component, supplemented by in-context demonstrations retrieved from a memory module storing experiences from previous tasks. In addition, we propose two novel benchmarks: EICU-AC benchmark to assess the access control for healthcare agents and Mind2Web-SC benchmark to evaluate the safety policies for web agents. We show that GuardAgent effectively moderates the violation actions for different types of agents on these two benchmarks with over 98% and 83% guardrail accuracies, respectively. Project page: https://guardagent.github.io/
Paper Structure (48 sections, 1 equation, 22 figures, 8 tables)

This paper contains 48 sections, 1 equation, 22 figures, 8 tables.

Figures (22)

  • Figure 1: Illustration of GuardAgent safeguarding other target agents on diverse tasks. Given a) a set of safety guard requests informed by a specification of the target agent and b) the input and output logs recording the target agent's action trajectories, GuardAgent first generates an action plan based on the experiences retrieved from the memory. Then, a guardrail code is generated based on the action plan with a list of callable functions. The actions of the target agent with safety violations will be denied by GuardAgent.
  • Figure 2: A toy example of GuardAgent executing a safety guard request for access control on a healthcare target agent (EHRAgent). A general administration user requests the lab results of a patient. However, based on the safety guard request, this user type cannot access the 'lab' database. GuardAgent detects this rule violation by analyzing the safety guard requests and the action proposed by the target agent via guardrail code generation and execution.
  • Figure 3: A case study comparing GuardAgent with the Model-Guarding-Agent baseline. For a query by a nurse (without access to the 'diagnosis' database) that requires access to both the 'medication' and 'diagnosis' databases (bolded), the baseline approach 'considerately' included the 'diagnosis' database to the accessible list for nursing, leading to an incorrect grant of access. GuardAgent, however, strictly follow the safety guard requests to generate guardrail code, which avoids making such 'autonomy-driven' mistakes.
  • Figure 4: Breakdown of GuardAgent results over three roles in EICU-AC and the six rules in Mind2Web-SC for GuardAgent with Llama3.3-70B (top row) and GPT-4 (bottom row), respectively. GuardAgent performs uniformly well for all roles and rules except for rule 5 related to movies, music, and videos due to the broader scenario coverage of the safety rule.
  • Figure 5: Performance of GuardAgent (with GPT-4 as the core LLM) provided with different numbers of demonstrations on EICU-AC and Mind2Web-SC.
  • ...and 17 more figures