Table of Contents
Fetching ...

AgentGuard: Repurposing Agentic Orchestrator for Safety Evaluation of Tool Orchestration

Jizhou Chen, Samuel Lee Cong

TL;DR

This work tackles the safety of tool-using LLM agents by proposing AgentGuard, a framework that uses the agent's own orchestrator as a safety evaluator. It automates four phases—unsafe workflow identification, unsafe workflow validation, safety constraint generation, and safety constraint validation—to discover risky tool-use sequences and firm up sandbox-based constraints before deployment. The contributions include a concrete architecture with three components (orchestrator, Safety Constraint Expert, and Prompting Proxy), a deliverable evaluation report, and initial empirical validation showing feasibility despite LLM-generated constraint challenges. The approach promises practical impact by enabling baseline safety guarantees, reusable threat intelligence, and benchmarks for hardening agent tool orchestration across diverse domains, with ongoing work to broaden evaluation and robustness.

Abstract

The integration of tool use into large language models (LLMs) enables agentic systems with real-world impact. In the meantime, unlike standalone LLMs, compromised agents can execute malicious workflows with more consequential impact, signified by their tool-use capability. We propose AgentGuard, a framework to autonomously discover and validate unsafe tool-use workflows, followed by generating safety constraints to confine the behaviors of agents, achieving the baseline of safety guarantee at deployment. AgentGuard leverages the LLM orchestrator's innate capabilities - knowledge of tool functionalities, scalable and realistic workflow generation, and tool execution privileges - to act as its own safety evaluator. The framework operates through four phases: identifying unsafe workflows, validating them in real-world execution, generating safety constraints, and validating constraint efficacy. The output, an evaluation report with unsafe workflows, test cases, and validated constraints, enables multiple security applications. We empirically demonstrate AgentGuard's feasibility with experiments. With this exploratory work, we hope to inspire the establishment of standardized testing and hardening procedures for LLM agents to enhance their trustworthiness in real-world applications.

AgentGuard: Repurposing Agentic Orchestrator for Safety Evaluation of Tool Orchestration

TL;DR

This work tackles the safety of tool-using LLM agents by proposing AgentGuard, a framework that uses the agent's own orchestrator as a safety evaluator. It automates four phases—unsafe workflow identification, unsafe workflow validation, safety constraint generation, and safety constraint validation—to discover risky tool-use sequences and firm up sandbox-based constraints before deployment. The contributions include a concrete architecture with three components (orchestrator, Safety Constraint Expert, and Prompting Proxy), a deliverable evaluation report, and initial empirical validation showing feasibility despite LLM-generated constraint challenges. The approach promises practical impact by enabling baseline safety guarantees, reusable threat intelligence, and benchmarks for hardening agent tool orchestration across diverse domains, with ongoing work to broaden evaluation and robustness.

Abstract

The integration of tool use into large language models (LLMs) enables agentic systems with real-world impact. In the meantime, unlike standalone LLMs, compromised agents can execute malicious workflows with more consequential impact, signified by their tool-use capability. We propose AgentGuard, a framework to autonomously discover and validate unsafe tool-use workflows, followed by generating safety constraints to confine the behaviors of agents, achieving the baseline of safety guarantee at deployment. AgentGuard leverages the LLM orchestrator's innate capabilities - knowledge of tool functionalities, scalable and realistic workflow generation, and tool execution privileges - to act as its own safety evaluator. The framework operates through four phases: identifying unsafe workflows, validating them in real-world execution, generating safety constraints, and validating constraint efficacy. The output, an evaluation report with unsafe workflows, test cases, and validated constraints, enables multiple security applications. We empirically demonstrate AgentGuard's feasibility with experiments. With this exploratory work, we hope to inspire the establishment of standardized testing and hardening procedures for LLM agents to enhance their trustworthiness in real-world applications.

Paper Structure

This paper contains 21 sections, 1 figure.

Figures (1)

  • Figure 1: Overview of AgentGuard. AgentGuard has three key components: 1) The LLM-based orchestrator within the target agent under evaluation, 2) A Safety Constraint Expert Agent responsible for safety constraint generation, and 3) A centralized Prompting Proxy Agent to instruct the other two components to perform testing and hardening. AgentGuard works in four main phases: 1) Unsafe Workflow Identification, 2) Unsafe Workflow Validation, 3) Safety Constraint Generation, and 4) Safety Constraint Validation. The deliverable of AgentGuard is an evaluation report containing aggregated evaluation results corresponding to different task scenarios.