DynaGuard: A Dynamic Guardian Model With User-Defined Policies
Monte Hoover, Vatsal Baherwani, Neel Jain, Khalid Saifullah, Joseph Vincent, Chirag Jain, Melissa Kazemi Rad, C. Bayan Bruss, Ashwinee Panda, Tom Goldstein
TL;DR
DynaGuard introduces a dynamic guardian framework that enforces user-defined, runtime guardrails to address the limitations of static safety taxonomies. It pairs DynaGuard with DynaBench, a large-scale, diverse dataset of 40,000 unique policies described over multi-turn dialogues, enabling rapid, policy-compliant judgments and optional reasoning traces for interpretability. The approach combines supervised fine-tuning (SFT) and policy-guided reinforcement learning (GRPO) on a 50/50 mix of dynamic and safety data, achieving state-of-the-art results on both traditional safety tasks and unseen, complex policies while maintaining fast inference. The work demonstrates cross-model generalization, interpretable explanations for policy violations, and practical routes for on-premises deployment, underscoring its potential to advance scalable and customizable guardrails in real-world LLM deployments.
Abstract
Guardian models play a crucial role in ensuring the safety and ethical behavior of user-facing AI applications by enforcing guardrails and detecting harmful content. While standard guardian models are limited to predefined, static harm categories, we introduce DynaGuard, a suite of dynamic guardian models offering novel flexibility by evaluating text based on user-defined policies, and DynaBench, a dataset for training and evaluating dynamic guardian models. Our models provide both rapid detection of policy violations and a chain-of-thought reasoning option that articulate and justify model outputs. Critically, DynaGuard not only surpasses static models in detection accuracy on traditional safety categories, but is competitive with frontier reasoning models on free-form policy violations, all in a fraction of the time. This makes DynaGuard an critical tool for language model guardrails.
