Table of Contents
Fetching ...

Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs

Oren Rachmil, Roy Betser, Itay Gershon, Omer Hofman, Nitay Yakoby, Yuval Meron, Idan Yankelev, Asaf Shabtai, Yuval Elovici, Roman Vainshtein

TL;DR

The paper tackles enterprise policy compliance for LLMs by reframing violations as out-of-distribution detections in the model’s activation space. It introduces a training-free approach that whitenes per-layer activations using statistics derived from a small in-policy set, and scores compliance with the Euclidean norm in the whitened space, calibrated via a mixed policy dataset. The method supports white-box and black-box deployments, requires no fine-tuning, and selects an operational layer with threshold calibration, achieving strong results on the Dynabench/DynaBench benchmark and outperforming LLM-as-a-judge and fine-tuned baselines with minimal latency. Practically, this yields a scalable, interpretable governance tool for continuous policy monitoring and updates in enterprise environments. The work demonstrates the viability of lightweight, category-aware whitening transforms as a robust, deployable mechanism for policy-aware oversight of LLMs.

Abstract

Aligning proprietary large language models (LLMs) with internal organizational policies has become an urgent priority as organizations increasingly deploy LLMs in sensitive domains such as legal support, finance, and medical services. Beyond generic safety filters, enterprises require reliable mechanisms to detect policy violations within their regulatory and operational frameworks, where breaches can trigger legal and reputational risks. Existing content moderation frameworks, such as guardrails, remain largely confined to the safety domain and lack the robustness to capture nuanced organizational policies. LLM-as-a-judge and fine-tuning approaches, though flexible, introduce significant latency and lack interpretability. To address these limitations, we propose a training-free and efficient method that treats policy violation detection as an out-of-distribution (OOD) detection problem. Inspired by whitening techniques, we apply a linear transformation to decorrelate the model's hidden activations and standardize them to zero mean and unit variance, yielding near-identity covariance matrix. In this transformed space, we use the Euclidean norm as a compliance score to detect policy violations. The method requires only the policy text and a small number of illustrative samples, which makes it light-weight and easily deployable. On a challenging policy benchmark, our approach achieves state-of-the-art results, surpassing both existing guardrails and fine-tuned reasoning models. This work provides organizations with a practical and statistically grounded framework for policy-aware oversight of LLMs, advancing the broader goal of deployable AI governance. Code is available at: https://tinyurl.com/policy-violation-detection

Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs

TL;DR

The paper tackles enterprise policy compliance for LLMs by reframing violations as out-of-distribution detections in the model’s activation space. It introduces a training-free approach that whitenes per-layer activations using statistics derived from a small in-policy set, and scores compliance with the Euclidean norm in the whitened space, calibrated via a mixed policy dataset. The method supports white-box and black-box deployments, requires no fine-tuning, and selects an operational layer with threshold calibration, achieving strong results on the Dynabench/DynaBench benchmark and outperforming LLM-as-a-judge and fine-tuned baselines with minimal latency. Practically, this yields a scalable, interpretable governance tool for continuous policy monitoring and updates in enterprise environments. The work demonstrates the viability of lightweight, category-aware whitening transforms as a robust, deployable mechanism for policy-aware oversight of LLMs.

Abstract

Aligning proprietary large language models (LLMs) with internal organizational policies has become an urgent priority as organizations increasingly deploy LLMs in sensitive domains such as legal support, finance, and medical services. Beyond generic safety filters, enterprises require reliable mechanisms to detect policy violations within their regulatory and operational frameworks, where breaches can trigger legal and reputational risks. Existing content moderation frameworks, such as guardrails, remain largely confined to the safety domain and lack the robustness to capture nuanced organizational policies. LLM-as-a-judge and fine-tuning approaches, though flexible, introduce significant latency and lack interpretability. To address these limitations, we propose a training-free and efficient method that treats policy violation detection as an out-of-distribution (OOD) detection problem. Inspired by whitening techniques, we apply a linear transformation to decorrelate the model's hidden activations and standardize them to zero mean and unit variance, yielding near-identity covariance matrix. In this transformed space, we use the Euclidean norm as a compliance score to detect policy violations. The method requires only the policy text and a small number of illustrative samples, which makes it light-weight and easily deployable. On a challenging policy benchmark, our approach achieves state-of-the-art results, surpassing both existing guardrails and fine-tuned reasoning models. This work provides organizations with a practical and statistically grounded framework for policy-aware oversight of LLMs, advancing the broader goal of deployable AI governance. Code is available at: https://tinyurl.com/policy-violation-detection

Paper Structure

This paper contains 35 sections, 22 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Illustration of the proposed policy-violation detection framework. Organizational policies define expected behavior for an internal LLM. When a user query produces a response, the model’s hidden activations are transformed using a whitening matrix derived from in-policy samples. Compliance is then estimated via the activation norm in this whitened space, and responses whose whitened norms exceed a calibrated governance threshold are flagged as policy violations.
  • Figure 2: Illustration of the offline phase of compliance calibration. In-policy and out-of-policy user-LLM interactions are first collected and passed through the model for activation extraction. Last-token hidden activation vectors are then used for distribution modeling to derive a whitening matrix that normalizes layer activations. Using this matrix, compliance scores are computed for both in-policy and out-of-policy activations, followed by ROC-AUC-based threshold calibration to identify the optimal decision boundary separating compliant and non-compliant interactions.
  • Figure 3: Illustration of the online compliance detection process. During runtime user-LLM interaction, last-token hidden activations are extracted and whitened using the precomputed whitening matrix. The resulting vector is used to compute a compliance score, which is compared against the precomputed calibrated threshold to determine whether the interaction is in-policy (compliant) or out-of-policy (violating).
  • Figure 4: Statistics of LLM activations before and after whitening.Top: Raw activations exhibit arbitrary means/variances and substantial cross-dimensional covariance. Bottom: Whitened activations are approximately zero-mean, unit-variance, with near-identity covariance. Category - content control.
  • Figure 5: Ablation study showing the effect of (top) varying Top-$K$ (with 100 samples per category) and (bottom) varying the number of samples per category (with Top-$K=15$) on F1 score.
  • ...and 7 more figures