Table of Contents
Fetching ...

Discovering Implicit Large Language Model Alignment Objectives

Edward Chen, Sanmi Koyejo, Carlos Guestrin

Abstract

Large language model (LLM) alignment relies on complex reward signals that often obscure the specific behaviors being incentivized, creating critical risks of misalignment and reward hacking. Existing interpretation methods typically rely on pre-defined rubrics, risking the omission of "unknown unknowns", or fail to identify objectives that comprehensively cover and are causal to the model behavior. To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives. Our approach utilizes an iterative greedy algorithm to analyze behavioral changes across training checkpoints, identifying and validating candidate objectives that best explain the residual reward signal. Extensive evaluations across diverse tasks, model sizes, and alignment algorithms demonstrate the framework's robustness. Experiments with popular open-source reward models show that the framework consistently captures > 90% of reward behavior, a finding further corroborated by human evaluation. Additionally, a case study on alignment with an open-source reward model reveals that Obj-Disco can successfully identify latent misaligned incentives that emerge alongside intended behaviors. Our work provides a crucial tool for uncovering the implicit objectives in LLM alignment, paving the way for more transparent and safer AI development.

Discovering Implicit Large Language Model Alignment Objectives

Abstract

Large language model (LLM) alignment relies on complex reward signals that often obscure the specific behaviors being incentivized, creating critical risks of misalignment and reward hacking. Existing interpretation methods typically rely on pre-defined rubrics, risking the omission of "unknown unknowns", or fail to identify objectives that comprehensively cover and are causal to the model behavior. To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives. Our approach utilizes an iterative greedy algorithm to analyze behavioral changes across training checkpoints, identifying and validating candidate objectives that best explain the residual reward signal. Extensive evaluations across diverse tasks, model sizes, and alignment algorithms demonstrate the framework's robustness. Experiments with popular open-source reward models show that the framework consistently captures > 90% of reward behavior, a finding further corroborated by human evaluation. Additionally, a case study on alignment with an open-source reward model reveals that Obj-Disco can successfully identify latent misaligned incentives that emerge alongside intended behaviors. Our work provides a crucial tool for uncovering the implicit objectives in LLM alignment, paving the way for more transparent and safer AI development.
Paper Structure (41 sections, 1 theorem, 15 equations, 34 figures, 2 tables, 1 algorithm)

This paper contains 41 sections, 1 theorem, 15 equations, 34 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1.2

nemhauser_analysis_1978 In the case of any normalized, monotonic submodular function $F$, the set $E_G$ obtained by the greedy algorithm achieves at least a constant fraction $(1-\dfrac{1}{e})$ of the objective value obtained by the optimal solution, that is,

Figures (34)

  • Figure 1: Schematic of Obj-Disco. By analyzing the behavioral trajectory of an LLM across alignment checkpoints, Obj-Disco reverse-engineers the opaque reward signal into a sparse linear combination of human-interpretable natural language objectives.
  • Figure 2: Obj-Disco Overview and Qualitative Results.(Left) Obj-Disco employs an iterative greedy search to construct the Discovered Interpretable Reward (DIR). A proposer LLM identifies candidates from high-residual samples, which are then verified for interpretability and trend predictability. (Right) Discovered objectives, and their weights, for open-source reward models (RM), demonstrating that Obj-Disco successfully recovers domain-specific alignment goals (e.g., conciseness for summaries, logic for code).
  • Figure 3: Controlled (Top), Open-Source Reward Model (Bottom) Results: (Top, L to R): (1) TLDR, PPO, Llama-8B, (2) TLDR, PPO, Qwen-4B, (3) TLDR, GRPO, Llama-8B, (4) TLDR, GRPO, Qwen-4B. (Bottom): Llama-8B. (1) Alpaca, GRPO (2) HH-RLHF, GRPO (3) TLDR, GRPO (4) Sky, GRPO. (6 trials each)
  • Figure 4: Qualitative Comparison of Case Study Discovered Objectives. Only Obj-Disco successfully identified the latent misaligned behavior (in red) implicitly incentivized by the open-source reward model. Baseline methods largely discovered narrow objectives indicative of helpfulness, failing to capture misaligned behavior. Only active objectives (non-zero coefficients) are shown.
  • Figure 5: Importance of Model Trajectory Ablation. Obj-Disco-Static only compares the base model and the final checkpoint, lacking the dynamic trajectory of Obj-Disco. Setting: Multi-Turn Dialogue, GRPO, Llama-8B. Controlled Evaluation. (6 trials)
  • ...and 29 more figures

Theorems & Definitions (2)

  • Definition 1.1: Submodular
  • Theorem 1.2