Table of Contents
Fetching ...

The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives

Matthieu Bou, Nyal Patel, Arjun Jagota, Satyapriya Krishna, Sonali Parbhoo

TL;DR

This work tackles the opacity of LLM objectives by reframing reward inference as a verification problem through The Alignment Auditor, a Bayesian IRL-based auditing framework. It first recovers a posterior over reward functions to quantify ambiguity, then uses sequential updates to contract epistemic uncertainty, followed by uncertainty-aware diagnostics to reveal shortcuts and OOD prompts. Finally, it validates the inferred reward at the policy level by integrating it into RLHF and showing toxicity reductions comparable to a ground-truth oracle. The framework offers auditors and regulators actionable, uncertainty-aware tools to verify what LLMs are truly optimizing and to strengthen alignment guarantees.

Abstract

The objectives that Large Language Models (LLMs) implicitly optimize remain dangerously opaque, making trustworthy alignment and auditing a grand challenge. While Inverse Reinforcement Learning (IRL) can infer reward functions from behaviour, existing approaches either produce a single, overconfident reward estimate or fail to address the fundamental ambiguity of the task (non-identifiability). This paper introduces a principled auditing framework that re-frames reward inference from a simple estimation task to a comprehensive process for verification. Our framework leverages Bayesian IRL to not only recover a distribution over objectives but to enable three critical audit capabilities: (i) Quantifying and systematically reducing non-identifiability by demonstrating posterior contraction over sequential rounds of evidence; (ii) Providing actionable, uncertainty-aware diagnostics that expose spurious shortcuts and identify out-of-distribution prompts where the inferred objective cannot be trusted; and (iii) Validating policy-level utility by showing that the refined, low-uncertainty reward can be used directly in RLHF to achieve training dynamics and toxicity reductions comparable to the ground-truth alignment process. Empirically, our framework successfully audits a detoxified LLM, yielding a well-calibrated and interpretable objective that strengthens alignment guarantees. Overall, this work provides a practical toolkit for auditors, safety teams, and regulators to verify what LLMs are truly trying to achieve, moving us toward more trustworthy and accountable AI.

The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives

TL;DR

This work tackles the opacity of LLM objectives by reframing reward inference as a verification problem through The Alignment Auditor, a Bayesian IRL-based auditing framework. It first recovers a posterior over reward functions to quantify ambiguity, then uses sequential updates to contract epistemic uncertainty, followed by uncertainty-aware diagnostics to reveal shortcuts and OOD prompts. Finally, it validates the inferred reward at the policy level by integrating it into RLHF and showing toxicity reductions comparable to a ground-truth oracle. The framework offers auditors and regulators actionable, uncertainty-aware tools to verify what LLMs are truly optimizing and to strengthen alignment guarantees.

Abstract

The objectives that Large Language Models (LLMs) implicitly optimize remain dangerously opaque, making trustworthy alignment and auditing a grand challenge. While Inverse Reinforcement Learning (IRL) can infer reward functions from behaviour, existing approaches either produce a single, overconfident reward estimate or fail to address the fundamental ambiguity of the task (non-identifiability). This paper introduces a principled auditing framework that re-frames reward inference from a simple estimation task to a comprehensive process for verification. Our framework leverages Bayesian IRL to not only recover a distribution over objectives but to enable three critical audit capabilities: (i) Quantifying and systematically reducing non-identifiability by demonstrating posterior contraction over sequential rounds of evidence; (ii) Providing actionable, uncertainty-aware diagnostics that expose spurious shortcuts and identify out-of-distribution prompts where the inferred objective cannot be trusted; and (iii) Validating policy-level utility by showing that the refined, low-uncertainty reward can be used directly in RLHF to achieve training dynamics and toxicity reductions comparable to the ground-truth alignment process. Empirically, our framework successfully audits a detoxified LLM, yielding a well-calibrated and interpretable objective that strengthens alignment guarantees. Overall, this work provides a practical toolkit for auditors, safety teams, and regulators to verify what LLMs are truly trying to achieve, moving us toward more trustworthy and accountable AI.

Paper Structure

This paper contains 14 sections, 8 equations, 9 figures, 2 algorithms.

Figures (9)

  • Figure 1: Overview of the three-stage alignment auditing framework. First, we learn a posterior distribution over rewards to quantify ambiguity in the reward function. Next, we assess the trustworthiness of the reward posterior using uncertainty diagnostics. Finally, we validate the utility of the inferred reward on a policy level by aligning the model to the inferred objective.
  • Figure 2: Analysis of the inferred reward for Llama-3.2-1B. The model is well-calibrated for both pairwise and single-text predictions (a), and the learned reward function shows a clear separation between toxic and non-toxic completions (b).
  • Figure 3: Performance and calibration metrics for our framework across different model scales. Larger models consistently achieve higher pairwise accuracy, single-text accuracy, AUROC, and F1-score, indicating a more faithful recovery of the expert’s preference signal. Pairwise and single-text Expected Calibration Error (ECE) generally decrease with model size, showing that the inferred reward probabilities are also more reliable for larger models.
  • Figure 4: Sequential Bayes analysis for Llama-1B. Across five rounds, the posterior contracts (a), epistemic uncertainty decreases (b), calibration improves (c), and performance metrics increase (d). This demonstrates the framework's ability to systematically reduce ambiguity.
  • Figure 5: Uncertainty-aware diagnostics. A PCA projection (left) shows that inputs with injected spurious features ('marked') have higher local uncertainty. A strong correlation (r=0.989) exists between reward variance and the Mahalanobis distance from the training pool (middle), confirming that uncertainty increases for out-of-distribution inputs. Policy-level alignment (right) via fine-tuning with the inferred reward after sequential contraction (Rounds 2–5) achieves toxicity reductions comparable to the oracle RLHF curve, validating policy-level utility (mean ± std over 5 runs). In contrast, using the under-identified round 1 posterior induces reward hacking with unstable training dynamics and worse final toxicity, highlighting the need for posterior contraction before alignment.
  • ...and 4 more figures