Table of Contents
Fetching ...

RePO: Understanding Preference Learning Through ReLU-Based Optimization

Junkang Wu, Kexin Huang, Xue Wang, Jinyang Gao, Bolin Ding, Jiancan Wu, Xiangnan He, Xiang Wang

TL;DR

The paper introduces RePO, a ReLU-based offline preference optimization method that removes the need for the $\beta$ parameter by enforcing a margin threshold $\gamma$, and shows that RePO is the $\beta \to \infty$ limit of SimPO, equivalent to the convex envelope of the $0$-$1$ loss. Theoretical results justify the approach by linking RePO to an optimal convex surrogate for binary preference, and empirical results across multiple models and benchmarks demonstrate competitive performance with a single tunable hyperparameter and additional gains from RePO++ and dynamic margin scheduling. The work also explores the relationship to existing methods (DPO, SLiC-HF, SimPO) and reveals practical insights into data filtering and curriculum learning that can improve offline preference learning. Overall, RePO provides a simple, principled, and effective alternative to more complex off-policy preference optimization frameworks with implications for scalable model alignment.

Abstract

Aligning large language models (LLMs) with human preferences is critical for real-world deployment, yet existing methods like RLHF face computational and stability challenges. While DPO establishes an offline paradigm with single hyperparameter $β$, subsequent methods like SimPO reintroduce complexity through dual parameters ($β$, $γ$). We propose {ReLU-based Preference Optimization (RePO)}, a streamlined algorithm that eliminates $β$ via two advances: (1) retaining SimPO's reference-free margins but removing $β$ through gradient analysis, and (2) adopting a ReLU-based max-margin loss that naturally filters trivial pairs. Theoretically, RePO is characterized as SimPO's limiting case ($β\to \infty$), where the logistic weighting collapses to binary thresholding, forming a convex envelope of the 0-1 loss. Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models, requiring only one hyperparameter to tune.

RePO: Understanding Preference Learning Through ReLU-Based Optimization

TL;DR

The paper introduces RePO, a ReLU-based offline preference optimization method that removes the need for the parameter by enforcing a margin threshold , and shows that RePO is the limit of SimPO, equivalent to the convex envelope of the - loss. Theoretical results justify the approach by linking RePO to an optimal convex surrogate for binary preference, and empirical results across multiple models and benchmarks demonstrate competitive performance with a single tunable hyperparameter and additional gains from RePO++ and dynamic margin scheduling. The work also explores the relationship to existing methods (DPO, SLiC-HF, SimPO) and reveals practical insights into data filtering and curriculum learning that can improve offline preference learning. Overall, RePO provides a simple, principled, and effective alternative to more complex off-policy preference optimization frameworks with implications for scalable model alignment.

Abstract

Aligning large language models (LLMs) with human preferences is critical for real-world deployment, yet existing methods like RLHF face computational and stability challenges. While DPO establishes an offline paradigm with single hyperparameter , subsequent methods like SimPO reintroduce complexity through dual parameters (, ). We propose {ReLU-based Preference Optimization (RePO)}, a streamlined algorithm that eliminates via two advances: (1) retaining SimPO's reference-free margins but removing through gradient analysis, and (2) adopting a ReLU-based max-margin loss that naturally filters trivial pairs. Theoretically, RePO is characterized as SimPO's limiting case (), where the logistic weighting collapses to binary thresholding, forming a convex envelope of the 0-1 loss. Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models, requiring only one hyperparameter to tune.

Paper Structure

This paper contains 33 sections, 8 theorems, 39 equations, 9 figures, 7 tables.

Key Result

Lemma 3.0

Under the same $M_\theta$ and $\gamma$ definitions, the SimPO gradient converges pointwise to the RePO gradient as $\beta \to \infty$:

Figures (9)

  • Figure 1: Comparing preference learning mechanisms. RePO employs a simpler binary thresholding mechanism than SimPO and DPO, as highlighted in the shaded box. Despite its simplicity, this mechanism achieves competitive results by naturally preventing over-optimization.
  • Figure 2: Gradient weighting functions of SimPO ($s_\theta$) and RePO ($\mathbb{I}(M_\theta < \gamma)$). As $\beta \to \infty$, $s_\theta$ converges to the binary indicator (red line), establishing RePO as the limit case of SimPO.
  • Figure 3: Performance of SimPO with varying $\beta$ and RePO on AlpacaEval2 benchmark.
  • Figure 4: Implicit reward margin $M_\theta$ distribution across training steps (total: 467) for RePO at $\gamma = 0.4$. Dashed line: $\gamma = 0.4$. Green: samples below $\gamma$ (gradient descent); gray: samples above $\gamma$ (zero gradient). Numbers: fraction of samples above $\gamma$.
  • Figure 5: Line plot of RePO performance (AlpacaEval 2 LC Win Rate) and bar chart of mean reward margins ($m_\mathcal{D}$) across varying $\gamma$ values. See Appendix \ref{['sec_appendix_vary_gamma']} for details.
  • ...and 4 more figures

Theorems & Definitions (15)

  • Lemma 3.0: Gradient Equivalence in the SimPO-to- Limit
  • proof : Sketch
  • Remark 3.1
  • Definition 4.1
  • Theorem 4.2: ReLU as Convex Envelope
  • Corollary 4.2: Optimality Preservation
  • Corollary 4.2: Logistic Loss Suboptimality
  • Lemma C.0: Gradient Equivalence in the SimPO-to- Limit
  • proof
  • Theorem C.1: ReLU as Convex Envelope
  • ...and 5 more