RePO: Understanding Preference Learning Through ReLU-Based Optimization

Junkang Wu; Kexin Huang; Xue Wang; Jinyang Gao; Bolin Ding; Jiancan Wu; Xiangnan He; Xiang Wang

RePO: Understanding Preference Learning Through ReLU-Based Optimization

Junkang Wu, Kexin Huang, Xue Wang, Jinyang Gao, Bolin Ding, Jiancan Wu, Xiangnan He, Xiang Wang

TL;DR

The paper introduces RePO, a ReLU-based offline preference optimization method that removes the need for the $\beta$ parameter by enforcing a margin threshold $\gamma$, and shows that RePO is the $\beta \to \infty$ limit of SimPO, equivalent to the convex envelope of the $0$-$1$ loss. Theoretical results justify the approach by linking RePO to an optimal convex surrogate for binary preference, and empirical results across multiple models and benchmarks demonstrate competitive performance with a single tunable hyperparameter and additional gains from RePO++ and dynamic margin scheduling. The work also explores the relationship to existing methods (DPO, SLiC-HF, SimPO) and reveals practical insights into data filtering and curriculum learning that can improve offline preference learning. Overall, RePO provides a simple, principled, and effective alternative to more complex off-policy preference optimization frameworks with implications for scalable model alignment.

Abstract

Aligning large language models (LLMs) with human preferences is critical for real-world deployment, yet existing methods like RLHF face computational and stability challenges. While DPO establishes an offline paradigm with single hyperparameter $β$, subsequent methods like SimPO reintroduce complexity through dual parameters ($β$, $γ$). We propose {ReLU-based Preference Optimization (RePO)}, a streamlined algorithm that eliminates $β$ via two advances: (1) retaining SimPO's reference-free margins but removing $β$ through gradient analysis, and (2) adopting a ReLU-based max-margin loss that naturally filters trivial pairs. Theoretically, RePO is characterized as SimPO's limiting case ($β\to \infty$), where the logistic weighting collapses to binary thresholding, forming a convex envelope of the 0-1 loss. Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models, requiring only one hyperparameter to tune.

RePO: Understanding Preference Learning Through ReLU-Based Optimization

TL;DR

The paper introduces RePO, a ReLU-based offline preference optimization method that removes the need for the

parameter by enforcing a margin threshold

, and shows that RePO is the

limit of SimPO, equivalent to the convex envelope of the

loss. Theoretical results justify the approach by linking RePO to an optimal convex surrogate for binary preference, and empirical results across multiple models and benchmarks demonstrate competitive performance with a single tunable hyperparameter and additional gains from RePO++ and dynamic margin scheduling. The work also explores the relationship to existing methods (DPO, SLiC-HF, SimPO) and reveals practical insights into data filtering and curriculum learning that can improve offline preference learning. Overall, RePO provides a simple, principled, and effective alternative to more complex off-policy preference optimization frameworks with implications for scalable model alignment.

Abstract

, subsequent methods like SimPO reintroduce complexity through dual parameters (

). We propose {ReLU-based Preference Optimization (RePO)}, a streamlined algorithm that eliminates

via two advances: (1) retaining SimPO's reference-free margins but removing

through gradient analysis, and (2) adopting a ReLU-based max-margin loss that naturally filters trivial pairs. Theoretically, RePO is characterized as SimPO's limiting case (

), where the logistic weighting collapses to binary thresholding, forming a convex envelope of the 0-1 loss. Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models, requiring only one hyperparameter to tune.

RePO: Understanding Preference Learning Through ReLU-Based Optimization

TL;DR

Abstract

RePO: Understanding Preference Learning Through ReLU-Based Optimization

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (15)