When Are Two RLHF Objectives the Same?
Madhava Gaikwad
TL;DR
Opal introduces a canonicalization framework to determine when two RLHF preference objectives are decision-theoretically equivalent by reducing margins to a unique normal form $\text{NF}(L) = \text{Add}[\Phi] \circ \text{Reweight}[s(x)] \circ \text{Link}[g]$ with $\sum_y \Phi(y)=0$. It shows many existing objectives (e.g., DPO and SPPO) collapse to the same canonical form, while GRPO is provably irreducible due to batch-dependent margins; a concrete witness demonstrates this distinction. The framework identifies four orthogonal axes—group normalization, pair-dependent weighting, token-level margins, and trajectory-level objectives—that generate genuinely new objectives beyond reparameterizations, and provides formal guarantees, witnesses, and worked examples. Practically, Opal equips researchers with certificates of equivalence or non-equivalence, clarifying when methodological diversity reflects true novelty versus re-expression of a common target.
Abstract
The preference optimization literature contains many proposed objectives, often presented as distinct improvements. We introduce Opal, a canonicalization algorithm that determines whether two preference objectives are algebraically equivalent by producing either a canonical form or a concrete witness of non-equivalence. Applying Opal reveals that many widely used methods optimize the same underlying objective, while others are provably distinct. For example, batch normalization can cause the same response pair to receive different gradients depending on batch composition. We identify a small set of structural mechanisms that give rise to genuinely different objectives; most remaining differences are reparameterizations.
