Table of Contents
Fetching ...

When Are Two RLHF Objectives the Same?

Madhava Gaikwad

TL;DR

Opal introduces a canonicalization framework to determine when two RLHF preference objectives are decision-theoretically equivalent by reducing margins to a unique normal form $\text{NF}(L) = \text{Add}[\Phi] \circ \text{Reweight}[s(x)] \circ \text{Link}[g]$ with $\sum_y \Phi(y)=0$. It shows many existing objectives (e.g., DPO and SPPO) collapse to the same canonical form, while GRPO is provably irreducible due to batch-dependent margins; a concrete witness demonstrates this distinction. The framework identifies four orthogonal axes—group normalization, pair-dependent weighting, token-level margins, and trajectory-level objectives—that generate genuinely new objectives beyond reparameterizations, and provides formal guarantees, witnesses, and worked examples. Practically, Opal equips researchers with certificates of equivalence or non-equivalence, clarifying when methodological diversity reflects true novelty versus re-expression of a common target.

Abstract

The preference optimization literature contains many proposed objectives, often presented as distinct improvements. We introduce Opal, a canonicalization algorithm that determines whether two preference objectives are algebraically equivalent by producing either a canonical form or a concrete witness of non-equivalence. Applying Opal reveals that many widely used methods optimize the same underlying objective, while others are provably distinct. For example, batch normalization can cause the same response pair to receive different gradients depending on batch composition. We identify a small set of structural mechanisms that give rise to genuinely different objectives; most remaining differences are reparameterizations.

When Are Two RLHF Objectives the Same?

TL;DR

Opal introduces a canonicalization framework to determine when two RLHF preference objectives are decision-theoretically equivalent by reducing margins to a unique normal form with . It shows many existing objectives (e.g., DPO and SPPO) collapse to the same canonical form, while GRPO is provably irreducible due to batch-dependent margins; a concrete witness demonstrates this distinction. The framework identifies four orthogonal axes—group normalization, pair-dependent weighting, token-level margins, and trajectory-level objectives—that generate genuinely new objectives beyond reparameterizations, and provides formal guarantees, witnesses, and worked examples. Practically, Opal equips researchers with certificates of equivalence or non-equivalence, clarifying when methodological diversity reflects true novelty versus re-expression of a common target.

Abstract

The preference optimization literature contains many proposed objectives, often presented as distinct improvements. We introduce Opal, a canonicalization algorithm that determines whether two preference objectives are algebraically equivalent by producing either a canonical form or a concrete witness of non-equivalence. Applying Opal reveals that many widely used methods optimize the same underlying objective, while others are provably distinct. For example, batch normalization can cause the same response pair to receive different gradients depending on batch composition. We identify a small set of structural mechanisms that give rise to genuinely different objectives; most remaining differences are reparameterizations.

Paper Structure

This paper contains 24 sections, 11 theorems, 28 equations, 1 figure, 4 tables.

Key Result

Theorem 3

Every reducible method has a unique canonical form where $\Phi$ satisfies the centering condition$\sum_y \Phi(y) = 0$.

Figures (1)

  • Figure 1: Opal either outputs a canonical form (proving equivalence) or a witness (proving non-equivalence). DPO, SPPO, and Nash-MD reduce to the same canonical hash. GRPO fails condition (R2) and outputs a witness showing batch-dependent margins.

Theorems & Definitions (16)

  • Definition 1: Margin operations
  • Example 1: DPO as a composition
  • Definition 2: Reducible
  • Example 2: Methods that fail (R2)
  • Theorem 3: Canonical form
  • Definition 4: Witnesses
  • Proposition 5: GRPO witness
  • Lemma 6: Add composition
  • Lemma 7: Reweight composition
  • Lemma 8: Add-Reweight commutation
  • ...and 6 more