Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs

Oren Rachmil; Roy Betser; Itay Gershon; Omer Hofman; Nitay Yakoby; Yuval Meron; Idan Yankelev; Asaf Shabtai; Yuval Elovici; Roman Vainshtein

Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs

Oren Rachmil, Roy Betser, Itay Gershon, Omer Hofman, Nitay Yakoby, Yuval Meron, Idan Yankelev, Asaf Shabtai, Yuval Elovici, Roman Vainshtein

TL;DR

The paper tackles enterprise policy compliance for LLMs by reframing violations as out-of-distribution detections in the model’s activation space. It introduces a training-free approach that whitenes per-layer activations using statistics derived from a small in-policy set, and scores compliance with the Euclidean norm in the whitened space, calibrated via a mixed policy dataset. The method supports white-box and black-box deployments, requires no fine-tuning, and selects an operational layer with threshold calibration, achieving strong results on the Dynabench/DynaBench benchmark and outperforming LLM-as-a-judge and fine-tuned baselines with minimal latency. Practically, this yields a scalable, interpretable governance tool for continuous policy monitoring and updates in enterprise environments. The work demonstrates the viability of lightweight, category-aware whitening transforms as a robust, deployable mechanism for policy-aware oversight of LLMs.

Abstract

Aligning proprietary large language models (LLMs) with internal organizational policies has become an urgent priority as organizations increasingly deploy LLMs in sensitive domains such as legal support, finance, and medical services. Beyond generic safety filters, enterprises require reliable mechanisms to detect policy violations within their regulatory and operational frameworks, where breaches can trigger legal and reputational risks. Existing content moderation frameworks, such as guardrails, remain largely confined to the safety domain and lack the robustness to capture nuanced organizational policies. LLM-as-a-judge and fine-tuning approaches, though flexible, introduce significant latency and lack interpretability. To address these limitations, we propose a training-free and efficient method that treats policy violation detection as an out-of-distribution (OOD) detection problem. Inspired by whitening techniques, we apply a linear transformation to decorrelate the model's hidden activations and standardize them to zero mean and unit variance, yielding near-identity covariance matrix. In this transformed space, we use the Euclidean norm as a compliance score to detect policy violations. The method requires only the policy text and a small number of illustrative samples, which makes it light-weight and easily deployable. On a challenging policy benchmark, our approach achieves state-of-the-art results, surpassing both existing guardrails and fine-tuned reasoning models. This work provides organizations with a practical and statistically grounded framework for policy-aware oversight of LLMs, advancing the broader goal of deployable AI governance. Code is available at: https://tinyurl.com/policy-violation-detection

Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs

TL;DR

Abstract

Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)