AgentGuard: Repurposing Agentic Orchestrator for Safety Evaluation of Tool Orchestration
Jizhou Chen, Samuel Lee Cong
TL;DR
This work tackles the safety of tool-using LLM agents by proposing AgentGuard, a framework that uses the agent's own orchestrator as a safety evaluator. It automates four phases—unsafe workflow identification, unsafe workflow validation, safety constraint generation, and safety constraint validation—to discover risky tool-use sequences and firm up sandbox-based constraints before deployment. The contributions include a concrete architecture with three components (orchestrator, Safety Constraint Expert, and Prompting Proxy), a deliverable evaluation report, and initial empirical validation showing feasibility despite LLM-generated constraint challenges. The approach promises practical impact by enabling baseline safety guarantees, reusable threat intelligence, and benchmarks for hardening agent tool orchestration across diverse domains, with ongoing work to broaden evaluation and robustness.
Abstract
The integration of tool use into large language models (LLMs) enables agentic systems with real-world impact. In the meantime, unlike standalone LLMs, compromised agents can execute malicious workflows with more consequential impact, signified by their tool-use capability. We propose AgentGuard, a framework to autonomously discover and validate unsafe tool-use workflows, followed by generating safety constraints to confine the behaviors of agents, achieving the baseline of safety guarantee at deployment. AgentGuard leverages the LLM orchestrator's innate capabilities - knowledge of tool functionalities, scalable and realistic workflow generation, and tool execution privileges - to act as its own safety evaluator. The framework operates through four phases: identifying unsafe workflows, validating them in real-world execution, generating safety constraints, and validating constraint efficacy. The output, an evaluation report with unsafe workflows, test cases, and validated constraints, enables multiple security applications. We empirically demonstrate AgentGuard's feasibility with experiments. With this exploratory work, we hope to inspire the establishment of standardized testing and hardening procedures for LLM agents to enhance their trustworthiness in real-world applications.
