Table of Contents
Fetching ...

Failure Modes of Maximum Entropy RLHF

Ömer Veysel Çağatan, Barış Akgün

TL;DR

The paper examines the limits of reference-free alignment by connecting Simple Preference Optimization (SimPO) to Maximum Entropy RL, offering a principled MaxEnt RL interpretation for SimPO. It then empirically evaluates online Maximum Entropy RLHF against KL-regularized RLHF across multiple Pythia scales on the TL;DR benchmark, revealing pronounced instability and overoptimization driven by entropy, even at conservative learning rates. In contrast, SimPO demonstrates strong offline performance, with stability aided by implicit data constraints and target margins, highlighting a fundamental discrepancy between offline and online settings. The findings suggest that additional regularization mechanisms beyond entropy are necessary for robust reference-free alignment in online contexts, while SimPO’s offline success is attributable to stabilizing factors absent in online training.

Abstract

In this paper, we show that Simple Preference Optimization (SimPO) can be derived as Maximum Entropy Reinforcement Learning, providing a theoretical foundation for this reference-free method. Motivated by SimPO's strong performance in offline preference optimization, we investigate whether Maximum Entropy RL can achieve similar results in online RLHF settings. Our experiments find that Maximum Entropy RL consistently exhibits overoptimization and unstable KL dynamics, even at very low learning rates. Unlike KL-constrained methods that maintain stable training, entropy regularization fails to prevent reward hacking and appears to correlate with overoptimization. Lastly, we discuss possible explanations for why SimPO succeeds in offline settings while Maximum Entropy RL struggles in online scenarios. Our findings suggest that reference-free approaches may face distinct challenges when applied to online or offline preference learning.

Failure Modes of Maximum Entropy RLHF

TL;DR

The paper examines the limits of reference-free alignment by connecting Simple Preference Optimization (SimPO) to Maximum Entropy RL, offering a principled MaxEnt RL interpretation for SimPO. It then empirically evaluates online Maximum Entropy RLHF against KL-regularized RLHF across multiple Pythia scales on the TL;DR benchmark, revealing pronounced instability and overoptimization driven by entropy, even at conservative learning rates. In contrast, SimPO demonstrates strong offline performance, with stability aided by implicit data constraints and target margins, highlighting a fundamental discrepancy between offline and online settings. The findings suggest that additional regularization mechanisms beyond entropy are necessary for robust reference-free alignment in online contexts, while SimPO’s offline success is attributable to stabilizing factors absent in online training.

Abstract

In this paper, we show that Simple Preference Optimization (SimPO) can be derived as Maximum Entropy Reinforcement Learning, providing a theoretical foundation for this reference-free method. Motivated by SimPO's strong performance in offline preference optimization, we investigate whether Maximum Entropy RL can achieve similar results in online RLHF settings. Our experiments find that Maximum Entropy RL consistently exhibits overoptimization and unstable KL dynamics, even at very low learning rates. Unlike KL-constrained methods that maintain stable training, entropy regularization fails to prevent reward hacking and appears to correlate with overoptimization. Lastly, we discuss possible explanations for why SimPO succeeds in offline settings while Maximum Entropy RL struggles in online scenarios. Our findings suggest that reference-free approaches may face distinct challenges when applied to online or offline preference learning.

Paper Structure

This paper contains 35 sections, 4 theorems, 41 equations, 12 figures.

Key Result

Lemma 1

Under the Plackett-Luce preference framework, and in particular the Bradley-Terry framework, two reward functions from the same equivalence class induce the same preference distribution.

Figures (12)

  • Figure 1: RLHF reward and entropy bonus during training for Pythia 1B with different entropy coefficients at learning rates 1e-6 (left) and 1e-7 (right). Win rates are reported in the legend for each entropy bonus coefficient setting.
  • Figure 2: RLHF reward and entropy bonus during training for Pythia 2.8B with different entropy bonus coefficients at learning rates 1e-7 (left) and 5e-8 (right). Win rates are reported in the legend for each entropy bonus coefficient setting.
  • Figure 3: KL divergence metrics for KL-constrained and Maximum Entropy RL across training. Top row shows KL divergence between the current policy and the reference policy ($KL(\pi_t \Vert \pi_{\text{ref}})$) for Pythia 1B (left) and 2.8B (right). Bottom row shows KL divergence between consecutive policy iterations ($KL(\pi_t \Vert \pi_{t-1})$).
  • Figure 4: Batch average of $\log\!\left( \frac{\pi_{\text{ref}}(y_w|x)}{\pi_{\text{ref}}(y_l|x)} \right)$ during DPO training on Pythia 1B.
  • Figure 5: KL divergence evolution during training for 1B and 2.8B parameter models using different regularization methods. The left panel shows results for the 1B model and the right panel shows results for the 2.8B model. Each panel compares KL-Constrained and Maximum-Entropy approaches. Checkmarks ($\checkmark$) indicate high win rate runs and crosses ($\times$) indicate overoptimized runs.
  • ...and 7 more figures

Theorems & Definitions (8)

  • Lemma 1: Lemma 1
  • proof
  • Lemma 2: Lemma 2
  • proof
  • Theorem 1: Maximum Entropy Version
  • proof
  • Proposition 1
  • proof