The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives

Matthieu Bou; Nyal Patel; Arjun Jagota; Satyapriya Krishna; Sonali Parbhoo

The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives

Matthieu Bou, Nyal Patel, Arjun Jagota, Satyapriya Krishna, Sonali Parbhoo

TL;DR

This work tackles the opacity of LLM objectives by reframing reward inference as a verification problem through The Alignment Auditor, a Bayesian IRL-based auditing framework. It first recovers a posterior over reward functions to quantify ambiguity, then uses sequential updates to contract epistemic uncertainty, followed by uncertainty-aware diagnostics to reveal shortcuts and OOD prompts. Finally, it validates the inferred reward at the policy level by integrating it into RLHF and showing toxicity reductions comparable to a ground-truth oracle. The framework offers auditors and regulators actionable, uncertainty-aware tools to verify what LLMs are truly optimizing and to strengthen alignment guarantees.

Abstract

The objectives that Large Language Models (LLMs) implicitly optimize remain dangerously opaque, making trustworthy alignment and auditing a grand challenge. While Inverse Reinforcement Learning (IRL) can infer reward functions from behaviour, existing approaches either produce a single, overconfident reward estimate or fail to address the fundamental ambiguity of the task (non-identifiability). This paper introduces a principled auditing framework that re-frames reward inference from a simple estimation task to a comprehensive process for verification. Our framework leverages Bayesian IRL to not only recover a distribution over objectives but to enable three critical audit capabilities: (i) Quantifying and systematically reducing non-identifiability by demonstrating posterior contraction over sequential rounds of evidence; (ii) Providing actionable, uncertainty-aware diagnostics that expose spurious shortcuts and identify out-of-distribution prompts where the inferred objective cannot be trusted; and (iii) Validating policy-level utility by showing that the refined, low-uncertainty reward can be used directly in RLHF to achieve training dynamics and toxicity reductions comparable to the ground-truth alignment process. Empirically, our framework successfully audits a detoxified LLM, yielding a well-calibrated and interpretable objective that strengthens alignment guarantees. Overall, this work provides a practical toolkit for auditors, safety teams, and regulators to verify what LLMs are truly trying to achieve, moving us toward more trustworthy and accountable AI.

The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives

TL;DR

Abstract

The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)