Executable Governance for AI: Translating Policies into Rules Using LLMs
Gautam Varma Datla, Anudeep Vurity, Tejaswani Dash, Tazeem Ahmad, Mohd Adnan, Saima Rafi
TL;DR
This paper addresses the challenge of turning policy prose into actionable, verifiable checks by introducing Policy-to-Tests (P2T), a pipeline paired with a compact DSL to produce machine-readable rules with provenance. The approach combines deterministic processing, LLM-based extraction and repair, and optional SMT validation to generate executable rules across multiple governance sources (EU AI Act, NIST RMF, HIPAA, MS RAI). It demonstrates strong agreement with human gold standards and shows downstream safety benefits through a HIPAA-based guardrail case study, while providing open-source code and resources for reproducible evaluation. The work offers a practical, auditable bridge from governance principles to enforceable tests suitable for integration into enforcement and evaluation systems, with identified future directions in interactive validation and richer semantic modeling.
Abstract
AI policy guidance is predominantly written as prose, which practitioners must first convert into executable rules before frameworks can evaluate or enforce them. This manual step is slow, error-prone, difficult to scale, and often delays the use of safeguards in real-world deployments. To address this gap, we present Policy-to-Tests (P2T), a framework that converts natural-language policy documents into normalized, machine-readable rules. The framework comprises a pipeline and a compact domain-specific language (DSL) that encodes hazards, scope, conditions, exceptions, and required evidence, yielding a canonical representation of extracted rules. To test the framework beyond a single policy, we apply it across general frameworks, sector guidance, and enterprise standards, extracting obligation-bearing clauses and converting them into executable rules. These AI-generated rules closely match strong human baselines on span-level and rule-level metrics, with robust inter-annotator agreement on the gold set. To evaluate downstream behavioral and safety impact, we add HIPAA-derived safeguards to a generative agent and compare it with an otherwise identical agent without guardrails. An LLM-based judge, aligned with gold-standard criteria, measures violation rates and robustness to obfuscated and compositional prompts. Detailed results are provided in the appendix. We release the codebase, DSL, prompts, and rule sets as open-source resources to enable reproducible evaluation.
