Table of Contents
Fetching ...

Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization

Audrey Huang, Wenhao Zhan, Tengyang Xie, Jason D. Lee, Wen Sun, Akshay Krishnamurthy, Dylan J. Foster

TL;DR

This work tackles overoptimization in offline language-model alignment by showing KL-regularization is insufficient and introducing χ^2-Preference Optimization (χPO), a minimal one-line modification of Direct Preference Optimization that embeds pessimism via mixed χ^2–KL regularization. χPO achieves provable robustness to overoptimization with a sample complexity bound that scales with single-policy concentrability, making it more data-efficient when offline coverage is partial. The authors extend the approach to general preference models through an iterative self-play framework, and provide theoretical analysis and empirical evidence—on TL;DR tasks—demonstrating improved stability and performance relative to DPO across various β and training regimes. The work also offers insights into the bias-overoptimization tradeoff and shows how χ^2-regularization yields stronger uncertainty quantification than KL alone, suggesting broad applicability in offline reinforcement learning and language-model alignment beyond Bradley–Terry settings.

Abstract

Language model alignment methods such as reinforcement learning from human feedback (RLHF) have led to impressive advances in language model capabilities, but are limited by a widely observed phenomenon known as overoptimization, where the quality of the language model degrades over the course of the alignment process. As the model optimizes performance with respect to an offline reward model, it overfits to inaccuracies and drifts away from preferred responses covered by the data. To discourage such distribution shift, KL-regularization is widely employed in existing offline alignment methods, but overoptimization continues to harm performance. Lending theoretical insight into the source of these empirical observations, we first show that the KL-regularization is too weak to prevent overfitting, then raise the following question: is it possible to design an efficient algorithm that is provably robust to overoptimization? We address this question with a new algorithm for offline alignment, $χ^2$-Preference Optimization ($χ$PO). $χ$PO is a one-line change to Direct Preference Optimization (DPO; Rafailov et al., 2023), which only involves modifying the logarithmic link function in the DPO objective. Despite this minimal change, $χ$PO implicitly implements the principle of pessimism in the face of uncertainty via regularization with the $χ^2$-divergence -- which quantifies uncertainty more effectively than KL-regularization -- and provably alleviates overoptimization, achieving sample-complexity guarantees based on single-policy concentrability -- the gold standard in offline reinforcement learning. $χ$PO's simplicity and strong guarantees make it the first practical and general-purpose offline alignment algorithm that is provably robust to overoptimization.

Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization

TL;DR

This work tackles overoptimization in offline language-model alignment by showing KL-regularization is insufficient and introducing χ^2-Preference Optimization (χPO), a minimal one-line modification of Direct Preference Optimization that embeds pessimism via mixed χ^2–KL regularization. χPO achieves provable robustness to overoptimization with a sample complexity bound that scales with single-policy concentrability, making it more data-efficient when offline coverage is partial. The authors extend the approach to general preference models through an iterative self-play framework, and provide theoretical analysis and empirical evidence—on TL;DR tasks—demonstrating improved stability and performance relative to DPO across various β and training regimes. The work also offers insights into the bias-overoptimization tradeoff and shows how χ^2-regularization yields stronger uncertainty quantification than KL alone, suggesting broad applicability in offline reinforcement learning and language-model alignment beyond Bradley–Terry settings.

Abstract

Language model alignment methods such as reinforcement learning from human feedback (RLHF) have led to impressive advances in language model capabilities, but are limited by a widely observed phenomenon known as overoptimization, where the quality of the language model degrades over the course of the alignment process. As the model optimizes performance with respect to an offline reward model, it overfits to inaccuracies and drifts away from preferred responses covered by the data. To discourage such distribution shift, KL-regularization is widely employed in existing offline alignment methods, but overoptimization continues to harm performance. Lending theoretical insight into the source of these empirical observations, we first show that the KL-regularization is too weak to prevent overfitting, then raise the following question: is it possible to design an efficient algorithm that is provably robust to overoptimization? We address this question with a new algorithm for offline alignment, -Preference Optimization (PO). PO is a one-line change to Direct Preference Optimization (DPO; Rafailov et al., 2023), which only involves modifying the logarithmic link function in the DPO objective. Despite this minimal change, PO implicitly implements the principle of pessimism in the face of uncertainty via regularization with the -divergence -- which quantifies uncertainty more effectively than KL-regularization -- and provably alleviates overoptimization, achieving sample-complexity guarantees based on single-policy concentrability -- the gold standard in offline reinforcement learning. PO's simplicity and strong guarantees make it the first practical and general-purpose offline alignment algorithm that is provably robust to overoptimization.
Paper Structure (84 sections, 27 theorems, 225 equations, 4 figures, 2 tables, 3 algorithms)

This paper contains 84 sections, 27 theorems, 225 equations, 4 figures, 2 tables, 3 algorithms.

Key Result

theorem 1

Suppose ass:realizabilityass:vmax hold for some $\beta>0$. With probability at least $1-\delta$, $\chi$PO (alg:main) produces a policy $\widehat{\pi}$ such that for all policies $\pi^{\star}$ simultaneously, we have In particular, given any comparator policy $\pi^{\star}$, we can choose the regularization parameter $\beta$ to achieve

Figures (4)

  • Figure 1: Behavior of the mixed $\chi^2$-regularization link function $\phi_{\texttt{$\chi$PO}\xspace}(z)=z+\log{}z$ and inverse $\phi_{\texttt{$\chi$PO}\xspace}^{-1}(z)=W_0(\exp(z))$, compared to the KL-regularization link function $\phi_{\texttt{DPO}\xspace}(z)=\log{}z$ and inverse $\phi_{\texttt{DPO}\xspace}^{-1}(z)=\exp(z)$. $\phi_{\texttt{$\chi$PO}\xspace}^{-1}(z)\approx z$ for $z \geq 1$, leading to favorable heavy-tailed, pessimistic behavior.
  • Figure 2: Action probabilities for policies learned by $\texttt{$\chi$PO}\xspace$ and $\texttt{DPO}\xspace$ on the example from \ref{['sec:illustrative']}, under the "bad" event $\mathcal{E}$ in which the true reward model is $r^\star=r_1$ but the estimated reward model is ${\widehat{r}} = r_2$ ($n = 10$). Here, $r^\star(a_{\mathsf{good}}) = 1$ and $r^\star(a_{\mathsf{bad}}) = 0$, but ${\widehat{r}}(a_{\mathsf{good}}) = 0$ and ${\widehat{r}}(a_{\mathsf{good}}) = 1$; both reward functions have $r^\star(a_0)={\widehat{r}}(a_0)=1/2$, and the goal is to compete with a comparator policy that deterministically plays $a_0$. Overoptimization. The DPO policy is greedier with respect to the incorrect reward model and places much larger mass on the bad action $a_{\mathsf{bad}}$ for all $\beta \in (0, \frac{1}{2\log n}]$ (Right). As a result, the DPO policy places much smaller mass on the baseline action $a_0$, suffering significantly more overoptimization error compared to $\chi$PO (Left; see also \ref{['fig:regret']}). Bias. Compared to DPO, $\chi$PO has a higher probability of taking both the optimal action $a_{\mathsf{good}}$ and the reference action $a_0$. As a result, it strikes a better bias-overoptimization tradeoff than DPO, and is competitive with respect to the comparator $a_0$ even when DPO fails to converge.
  • Figure 3: The regret $J(a_0)-J(\widehat{\pi})$ in the construction from \ref{['prop:rpo_lower']} for different values of $n$. We again condition on the "bad" event $\mathcal{E}$ where ${\widehat{r}} = r_2 \neq r^\star$. For each $n$, the error from overoptimization dominates when $\beta \le (2\log n)^{-1}$ (as discussed in \ref{['sec:illustrative']}), and the error from bias dominates when $\beta > (2\log n)^{-1}$. Taking the best choice of $\beta$ for each method, DPO converges at an exponentially slower rate than $\chi$PO.
  • Figure 4: (Left) TL;DR Summarization winrate recorded longitudinally over 2 epochs of training every 250 steps. Shaded area displays $\pm 1$ standard error over 3 seeds. At 1 epoch $\chi$PO already obtains better performance, and continues to improve over the course of training, while DPO degrades over time. (Right) KL divergence $D_{\mathsf{KL}}(*){\widehat{\pi}\,\|\,\pi_\mathsf{ref}}$ averaged over 2 of the seeds. For the same $\beta$, $\chi$PO constrains the learned policy to be significantly closer to $\pi_\mathsf{ref}$, thereby striking a better bias-variance tradeoff.

Theorems & Definitions (49)

  • theorem 1: Sample complexity bound for $\chi$PO
  • corollary 1: Sample complexity bound for $\chi$PO with a reward model
  • proposition 1
  • proposition 2
  • remark 1: DPO decreases probabilities of preferred and rejected responses
  • lemma 1: Informal version of \ref{['lem:clip-dpo-estimation']}
  • lemma 2: Informal version of \ref{['lem:general-reward-to-policy']}
  • theorem 2: Impossibility of single-policy concentrability under general preferences
  • theorem 3
  • proposition 3
  • ...and 39 more