Table of Contents
Fetching ...

Executable Governance for AI: Translating Policies into Rules Using LLMs

Gautam Varma Datla, Anudeep Vurity, Tejaswani Dash, Tazeem Ahmad, Mohd Adnan, Saima Rafi

TL;DR

This paper addresses the challenge of turning policy prose into actionable, verifiable checks by introducing Policy-to-Tests (P2T), a pipeline paired with a compact DSL to produce machine-readable rules with provenance. The approach combines deterministic processing, LLM-based extraction and repair, and optional SMT validation to generate executable rules across multiple governance sources (EU AI Act, NIST RMF, HIPAA, MS RAI). It demonstrates strong agreement with human gold standards and shows downstream safety benefits through a HIPAA-based guardrail case study, while providing open-source code and resources for reproducible evaluation. The work offers a practical, auditable bridge from governance principles to enforceable tests suitable for integration into enforcement and evaluation systems, with identified future directions in interactive validation and richer semantic modeling.

Abstract

AI policy guidance is predominantly written as prose, which practitioners must first convert into executable rules before frameworks can evaluate or enforce them. This manual step is slow, error-prone, difficult to scale, and often delays the use of safeguards in real-world deployments. To address this gap, we present Policy-to-Tests (P2T), a framework that converts natural-language policy documents into normalized, machine-readable rules. The framework comprises a pipeline and a compact domain-specific language (DSL) that encodes hazards, scope, conditions, exceptions, and required evidence, yielding a canonical representation of extracted rules. To test the framework beyond a single policy, we apply it across general frameworks, sector guidance, and enterprise standards, extracting obligation-bearing clauses and converting them into executable rules. These AI-generated rules closely match strong human baselines on span-level and rule-level metrics, with robust inter-annotator agreement on the gold set. To evaluate downstream behavioral and safety impact, we add HIPAA-derived safeguards to a generative agent and compare it with an otherwise identical agent without guardrails. An LLM-based judge, aligned with gold-standard criteria, measures violation rates and robustness to obfuscated and compositional prompts. Detailed results are provided in the appendix. We release the codebase, DSL, prompts, and rule sets as open-source resources to enable reproducible evaluation.

Executable Governance for AI: Translating Policies into Rules Using LLMs

TL;DR

This paper addresses the challenge of turning policy prose into actionable, verifiable checks by introducing Policy-to-Tests (P2T), a pipeline paired with a compact DSL to produce machine-readable rules with provenance. The approach combines deterministic processing, LLM-based extraction and repair, and optional SMT validation to generate executable rules across multiple governance sources (EU AI Act, NIST RMF, HIPAA, MS RAI). It demonstrates strong agreement with human gold standards and shows downstream safety benefits through a HIPAA-based guardrail case study, while providing open-source code and resources for reproducible evaluation. The work offers a practical, auditable bridge from governance principles to enforceable tests suitable for integration into enforcement and evaluation systems, with identified future directions in interactive validation and richer semantic modeling.

Abstract

AI policy guidance is predominantly written as prose, which practitioners must first convert into executable rules before frameworks can evaluate or enforce them. This manual step is slow, error-prone, difficult to scale, and often delays the use of safeguards in real-world deployments. To address this gap, we present Policy-to-Tests (P2T), a framework that converts natural-language policy documents into normalized, machine-readable rules. The framework comprises a pipeline and a compact domain-specific language (DSL) that encodes hazards, scope, conditions, exceptions, and required evidence, yielding a canonical representation of extracted rules. To test the framework beyond a single policy, we apply it across general frameworks, sector guidance, and enterprise standards, extracting obligation-bearing clauses and converting them into executable rules. These AI-generated rules closely match strong human baselines on span-level and rule-level metrics, with robust inter-annotator agreement on the gold set. To evaluate downstream behavioral and safety impact, we add HIPAA-derived safeguards to a generative agent and compare it with an otherwise identical agent without guardrails. An LLM-based judge, aligned with gold-standard criteria, measures violation rates and robustness to obfuscated and compositional prompts. Detailed results are provided in the appendix. We release the codebase, DSL, prompts, and rule sets as open-source resources to enable reproducible evaluation.

Paper Structure

This paper contains 13 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: P2T overview. The pipeline reads policy documents and returns executable atomic rules. It does so by iteratively extracting and refining rules with LLMs and deterministic checks, including clause mining, evidence gating, and SMT (Satisfiability Modulo Theories) validation.