Failure Modes of Maximum Entropy RLHF
Ömer Veysel Çağatan, Barış Akgün
TL;DR
The paper examines the limits of reference-free alignment by connecting Simple Preference Optimization (SimPO) to Maximum Entropy RL, offering a principled MaxEnt RL interpretation for SimPO. It then empirically evaluates online Maximum Entropy RLHF against KL-regularized RLHF across multiple Pythia scales on the TL;DR benchmark, revealing pronounced instability and overoptimization driven by entropy, even at conservative learning rates. In contrast, SimPO demonstrates strong offline performance, with stability aided by implicit data constraints and target margins, highlighting a fundamental discrepancy between offline and online settings. The findings suggest that additional regularization mechanisms beyond entropy are necessary for robust reference-free alignment in online contexts, while SimPO’s offline success is attributable to stabilizing factors absent in online training.
Abstract
In this paper, we show that Simple Preference Optimization (SimPO) can be derived as Maximum Entropy Reinforcement Learning, providing a theoretical foundation for this reference-free method. Motivated by SimPO's strong performance in offline preference optimization, we investigate whether Maximum Entropy RL can achieve similar results in online RLHF settings. Our experiments find that Maximum Entropy RL consistently exhibits overoptimization and unstable KL dynamics, even at very low learning rates. Unlike KL-constrained methods that maintain stable training, entropy regularization fails to prevent reward hacking and appears to correlate with overoptimization. Lastly, we discuss possible explanations for why SimPO succeeds in offline settings while Maximum Entropy RL struggles in online scenarios. Our findings suggest that reference-free approaches may face distinct challenges when applied to online or offline preference learning.
