RePO: Understanding Preference Learning Through ReLU-Based Optimization
Junkang Wu, Kexin Huang, Xue Wang, Jinyang Gao, Bolin Ding, Jiancan Wu, Xiangnan He, Xiang Wang
TL;DR
The paper introduces RePO, a ReLU-based offline preference optimization method that removes the need for the $\beta$ parameter by enforcing a margin threshold $\gamma$, and shows that RePO is the $\beta \to \infty$ limit of SimPO, equivalent to the convex envelope of the $0$-$1$ loss. Theoretical results justify the approach by linking RePO to an optimal convex surrogate for binary preference, and empirical results across multiple models and benchmarks demonstrate competitive performance with a single tunable hyperparameter and additional gains from RePO++ and dynamic margin scheduling. The work also explores the relationship to existing methods (DPO, SLiC-HF, SimPO) and reveals practical insights into data filtering and curriculum learning that can improve offline preference learning. Overall, RePO provides a simple, principled, and effective alternative to more complex off-policy preference optimization frameworks with implications for scalable model alignment.
Abstract
Aligning large language models (LLMs) with human preferences is critical for real-world deployment, yet existing methods like RLHF face computational and stability challenges. While DPO establishes an offline paradigm with single hyperparameter $β$, subsequent methods like SimPO reintroduce complexity through dual parameters ($β$, $γ$). We propose {ReLU-based Preference Optimization (RePO)}, a streamlined algorithm that eliminates $β$ via two advances: (1) retaining SimPO's reference-free margins but removing $β$ through gradient analysis, and (2) adopting a ReLU-based max-margin loss that naturally filters trivial pairs. Theoretically, RePO is characterized as SimPO's limiting case ($β\to \infty$), where the logistic weighting collapses to binary thresholding, forming a convex envelope of the 0-1 loss. Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models, requiring only one hyperparameter to tune.
