Table of Contents
Fetching ...

DynaGuard: A Dynamic Guardian Model With User-Defined Policies

Monte Hoover, Vatsal Baherwani, Neel Jain, Khalid Saifullah, Joseph Vincent, Chirag Jain, Melissa Kazemi Rad, C. Bayan Bruss, Ashwinee Panda, Tom Goldstein

TL;DR

DynaGuard introduces a dynamic guardian framework that enforces user-defined, runtime guardrails to address the limitations of static safety taxonomies. It pairs DynaGuard with DynaBench, a large-scale, diverse dataset of 40,000 unique policies described over multi-turn dialogues, enabling rapid, policy-compliant judgments and optional reasoning traces for interpretability. The approach combines supervised fine-tuning (SFT) and policy-guided reinforcement learning (GRPO) on a 50/50 mix of dynamic and safety data, achieving state-of-the-art results on both traditional safety tasks and unseen, complex policies while maintaining fast inference. The work demonstrates cross-model generalization, interpretable explanations for policy violations, and practical routes for on-premises deployment, underscoring its potential to advance scalable and customizable guardrails in real-world LLM deployments.

Abstract

Guardian models play a crucial role in ensuring the safety and ethical behavior of user-facing AI applications by enforcing guardrails and detecting harmful content. While standard guardian models are limited to predefined, static harm categories, we introduce DynaGuard, a suite of dynamic guardian models offering novel flexibility by evaluating text based on user-defined policies, and DynaBench, a dataset for training and evaluating dynamic guardian models. Our models provide both rapid detection of policy violations and a chain-of-thought reasoning option that articulate and justify model outputs. Critically, DynaGuard not only surpasses static models in detection accuracy on traditional safety categories, but is competitive with frontier reasoning models on free-form policy violations, all in a fraction of the time. This makes DynaGuard an critical tool for language model guardrails.

DynaGuard: A Dynamic Guardian Model With User-Defined Policies

TL;DR

DynaGuard introduces a dynamic guardian framework that enforces user-defined, runtime guardrails to address the limitations of static safety taxonomies. It pairs DynaGuard with DynaBench, a large-scale, diverse dataset of 40,000 unique policies described over multi-turn dialogues, enabling rapid, policy-compliant judgments and optional reasoning traces for interpretability. The approach combines supervised fine-tuning (SFT) and policy-guided reinforcement learning (GRPO) on a 50/50 mix of dynamic and safety data, achieving state-of-the-art results on both traditional safety tasks and unseen, complex policies while maintaining fast inference. The work demonstrates cross-model generalization, interpretable explanations for policy violations, and practical routes for on-premises deployment, underscoring its potential to advance scalable and customizable guardrails in real-world LLM deployments.

Abstract

Guardian models play a crucial role in ensuring the safety and ethical behavior of user-facing AI applications by enforcing guardrails and detecting harmful content. While standard guardian models are limited to predefined, static harm categories, we introduce DynaGuard, a suite of dynamic guardian models offering novel flexibility by evaluating text based on user-defined policies, and DynaBench, a dataset for training and evaluating dynamic guardian models. Our models provide both rapid detection of policy violations and a chain-of-thought reasoning option that articulate and justify model outputs. Critically, DynaGuard not only surpasses static models in detection accuracy on traditional safety categories, but is competitive with frontier reasoning models on free-form policy violations, all in a fraction of the time. This makes DynaGuard an critical tool for language model guardrails.

Paper Structure

This paper contains 39 sections, 4 equations, 16 figures, 17 tables.

Figures (16)

  • Figure 1: We introduce guardian models that enforce arbitrary policies at runtime. When the guardian model (indicated by the shield) is coupled with a language model assistant, it can protect against undesired or harmful outputs. Additionally, our model provides detailed explanations when a policy is violated, enabling the chat model to recover and correct its policy-violating behavior.
  • Figure 2: Pipeline for synthesizing DynaBench training set. Diversity is seeded into the dataset samples through large banks of static attributes and rules. For the agent persona in each dialogue, we use LLMs to develop rich backgrounds on the company/use case associated with the agent. The policy is also provided to the LLM to generate a relevant dialogue.
  • Figure 3: Failure case analysis on DynaBench. The left and center figure columns show model accuracy on subsets of the benchmark where particular attributes are isolated. The top left shows the number of rules in each sample's policy, with Qwen3 showing decreased accuracy with the progression from single rule policies to policies with more than 40 rules. The bottom left shows the length of the dialogue as measured by the number of turns, and the top center shows the length of the combined dialogue and policy as measured by the number of tokens. Bottom center shows the number of logical hops present in samples (See \ref{['app:policy_dist']}). The top right shows accuracy on subsets of the benchmark, divided by the failure mode that each sample highlights and described in detail in \ref{['app:failure_modes']}. Bottom right shows this analysis broken down by the category of business impact that each sample highlights.
  • Figure 4: Distribution of failure modes highlighted in the test set. Each sample in the benchmark is annotated with one primary failure mode.
  • Figure 5: Distribution of business impacts the test set samples relate to. Each sample is annotated with a single business impact.
  • ...and 11 more figures