Table of Contents
Fetching ...

Building a Foundational Guardrail for General Agentic Systems via Synthetic Data

Yue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing, Manish Nagireddy, Inkit Padhi, Greta Dolcetti, Zhangchen Xu, Subhajit Chaudhury, Ambrish Rawat, Liubov Nedoshivina, Pin-Yu Chen, Prasanna Sattigeri, Xiangliang Zhang

TL;DR

The paper tackles the safety of LLM-based agentic systems at the pre-execution planning stage by addressing data, model, and evaluation gaps. It introduces AuraGen to synthesize controlled risk trajectories, Safiron as a guardian trained via supervised fine-tuning and GRPO reinforcement learning, and Pre-Exec Bench to assess planning-time safety across diverse scenarios. Empirical results show Safiron outperforms baselines in detection, risk categorization, and explanation quality, with ablations clarifying the impact of data composition and RL training. This work offers a scalable, adaptable blueprint for safer, more reliable agentic systems across high-stakes applications.

Abstract

While LLM agents can plan multi-step tasks, intervening at the planning stage-before any action is executed-is often the safest way to prevent harm, since certain risks can lead to severe consequences once carried out. However, existing guardrails mostly operate post-execution, which is difficult to scale and leaves little room for controllable supervision at the plan level. To address this challenge, we highlight three critical gaps in current research: data gap, model gap, and evaluation gap. To close the data gap, we introduce AuraGen, a controllable engine that (i) synthesizes benign trajectories, (ii) injects category-labeled risks with calibrated difficulty, and (iii) filters outputs via an automated reward model, producing large and reliable corpora for pre-execution safety. To close the guardian model gap, we propose a foundational guardrail Safiron, combining a cross-planner adapter with a compact guardian model. The adapter unifies different input formats, while Safiron flags risky cases, assigns risk types, and generates rationales; trained in two stages with a broadly explored data recipe, Safiron achieves robust transfer across settings. To close the evaluation gap, we release Pre-Exec Bench, a realistic benchmark covering diverse tools and branching trajectories, which measures detection, fine-grained categorization, explanation, and cross-planner generalization in human-verified scenarios. Extensive experiments demonstrate consistent gains of the proposed guardrail over strong baselines on Pre-Exec Bench, and ablations further distill actionable practices, providing a practical template for safer agentic systems.

Building a Foundational Guardrail for General Agentic Systems via Synthetic Data

TL;DR

The paper tackles the safety of LLM-based agentic systems at the pre-execution planning stage by addressing data, model, and evaluation gaps. It introduces AuraGen to synthesize controlled risk trajectories, Safiron as a guardian trained via supervised fine-tuning and GRPO reinforcement learning, and Pre-Exec Bench to assess planning-time safety across diverse scenarios. Empirical results show Safiron outperforms baselines in detection, risk categorization, and explanation quality, with ablations clarifying the impact of data composition and RL training. This work offers a scalable, adaptable blueprint for safer, more reliable agentic systems across high-stakes applications.

Abstract

While LLM agents can plan multi-step tasks, intervening at the planning stage-before any action is executed-is often the safest way to prevent harm, since certain risks can lead to severe consequences once carried out. However, existing guardrails mostly operate post-execution, which is difficult to scale and leaves little room for controllable supervision at the plan level. To address this challenge, we highlight three critical gaps in current research: data gap, model gap, and evaluation gap. To close the data gap, we introduce AuraGen, a controllable engine that (i) synthesizes benign trajectories, (ii) injects category-labeled risks with calibrated difficulty, and (iii) filters outputs via an automated reward model, producing large and reliable corpora for pre-execution safety. To close the guardian model gap, we propose a foundational guardrail Safiron, combining a cross-planner adapter with a compact guardian model. The adapter unifies different input formats, while Safiron flags risky cases, assigns risk types, and generates rationales; trained in two stages with a broadly explored data recipe, Safiron achieves robust transfer across settings. To close the evaluation gap, we release Pre-Exec Bench, a realistic benchmark covering diverse tools and branching trajectories, which measures detection, fine-grained categorization, explanation, and cross-planner generalization in human-verified scenarios. Extensive experiments demonstrate consistent gains of the proposed guardrail over strong baselines on Pre-Exec Bench, and ablations further distill actionable practices, providing a practical template for safer agentic systems.

Paper Structure

This paper contains 29 sections, 7 equations, 23 figures, 7 tables.

Figures (23)

  • Figure 1: Workflow of AuraGen as well as four risk injection strategies employed by AuraGen.
  • Figure 2: Deployment pipeline of proposed guardrail framework.
  • Figure 3: The training pipeline of Safiron.
  • Figure 4: Left: Construction steps of Pre-Exec Bench. Right: Risk type distribution. The benchmark consists of 1,001 harmless and 671 risky samples (with injected risks).
  • Figure 5: The performance under the different ratios of hard/easy samples during GRPO training.
  • ...and 18 more figures