Table of Contents
Fetching ...

Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

Hyung Gyu Rho

TL;DR

MADPO tackles the brittleness of Direct Preference Optimization caused by a fixed temperature by introducing an instance-level, margin-adaptive reweighting scheme. It learns per-sample margins with a reward model and then applies a continuous weight to the DPO loss, amplifying informative low-margin pairs while dampening easy high-margin ones. Theoretical results establish a stable optimization landscape and robustness to reward-estimation errors, while experiments on synthetic IMDB sentiment data show MADPO outperforms DPO, IPO, and $\beta$-DPO across data qualities, with notable gains on high-quality data. Collectively, MADPO offers a principled, robust method for fine-grained preference alignment with practical gains and stability benefits.

Abstract

Direct Preference Optimization (DPO) has emerged as a simple and effective method for aligning large language models. However, its reliance on a fixed temperature parameter leads to suboptimal training on diverse preference data, causing overfitting on easy examples and under-learning from informative ones. Recent methods have emerged to counter this. While IPO addresses general overfitting, its uniform regularization can be overly conservative. The more targeted approach of $β$-DPO suffers from its own limitations: its batch-level adaptation applies a single, compromised temperature to mixed-margin pairs, its linear update rule can produce unstable negative $β$ values, and its filtering mechanism discards potentially useful training signals. In this work, we introduce Margin-Adaptive Direct Preference Optimization (MADPO), a method that provides a stable, data-preserving, and instance-level solution. MADPO employs a practical two-step approach: it first trains a reward model to estimate preference margins and then uses these margins to apply a continuous, adaptive weight to the DPO loss for each individual training sample. This re-weighting scheme creates an effective target margin that is amplified for hard pairs and dampened for easy pairs, allowing for granular control over the learning signal. We provide a comprehensive theoretical analysis, proving that MADPO has a well-behaved optimization landscape and is robust to reward model estimation errors. We validate our theory with experiments on a sentiment generation task, where MADPO consistently and significantly outperforms strong baselines across datasets of varying quality. It achieves performance gains of up to +33.3\% on High Quality data and +10.5\% on Low Quality data over the next-best method. Our results establish MADPO as a more robust and principled approach to preference alignment.

Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

TL;DR

MADPO tackles the brittleness of Direct Preference Optimization caused by a fixed temperature by introducing an instance-level, margin-adaptive reweighting scheme. It learns per-sample margins with a reward model and then applies a continuous weight to the DPO loss, amplifying informative low-margin pairs while dampening easy high-margin ones. Theoretical results establish a stable optimization landscape and robustness to reward-estimation errors, while experiments on synthetic IMDB sentiment data show MADPO outperforms DPO, IPO, and -DPO across data qualities, with notable gains on high-quality data. Collectively, MADPO offers a principled, robust method for fine-grained preference alignment with practical gains and stability benefits.

Abstract

Direct Preference Optimization (DPO) has emerged as a simple and effective method for aligning large language models. However, its reliance on a fixed temperature parameter leads to suboptimal training on diverse preference data, causing overfitting on easy examples and under-learning from informative ones. Recent methods have emerged to counter this. While IPO addresses general overfitting, its uniform regularization can be overly conservative. The more targeted approach of -DPO suffers from its own limitations: its batch-level adaptation applies a single, compromised temperature to mixed-margin pairs, its linear update rule can produce unstable negative values, and its filtering mechanism discards potentially useful training signals. In this work, we introduce Margin-Adaptive Direct Preference Optimization (MADPO), a method that provides a stable, data-preserving, and instance-level solution. MADPO employs a practical two-step approach: it first trains a reward model to estimate preference margins and then uses these margins to apply a continuous, adaptive weight to the DPO loss for each individual training sample. This re-weighting scheme creates an effective target margin that is amplified for hard pairs and dampened for easy pairs, allowing for granular control over the learning signal. We provide a comprehensive theoretical analysis, proving that MADPO has a well-behaved optimization landscape and is robust to reward model estimation errors. We validate our theory with experiments on a sentiment generation task, where MADPO consistently and significantly outperforms strong baselines across datasets of varying quality. It achieves performance gains of up to +33.3\% on High Quality data and +10.5\% on Low Quality data over the next-best method. Our results establish MADPO as a more robust and principled approach to preference alignment.

Paper Structure

This paper contains 39 sections, 6 theorems, 40 equations, 3 figures.

Key Result

Proposition 4.1

Under the BTL model with an optimal reward model $r_{\phi^*}$, the optimal policy parameter $\theta^*$ that minimizes the MADPO loss $\mathcal{L}(\theta, \phi^*;x,y_w,y_l)$ satisfies the following for any preference pair $(x, y_w, y_l) \in \mathcal{D}_{\text{low}}$: where the low-margin subset $\mathcal{D}_{\text{low}}$ is defined as:

Figures (3)

  • Figure 1: Main experimental results. (a) Table of mean rewards (standard error) for all methods across three data quality tiers. (b) Bar chart visualizing the mean rewards, clearly showing MADPO's superior performance and robustness compared to baselines. For $\beta$-DPO and MADPO, we report the performance of the best hyperparameter configuration found for each individual tier.
  • Figure 2: Sensitivity analysis for MADPO's key hyperparameters across the three data quality tiers. (Left) Performance as a function of the margin threshold, $\tau$. Higher values are generally better, though performance plateaus on High Quality data. (Right) Performance as a function of the amplification intensity, $c$, where we set $c_{\max} = c$ and $c_{\min} = 1/c$. Performance consistently improves with higher intensity across all tiers.
  • Figure 3: Ablation study of MADPO's amplification and regularization components. The study compares the full MADPO model against vanilla DPO and two ablated versions: Amp-Only, which only amplifies low-margin pairs (by setting $w(h_{\hat{\phi}})=1$ for $|h_{\hat{\phi}}| \ge \tau$), and Reg-Only, which only regularizes high-margin pairs (by setting $w(h_{\hat{\phi}})=1$ for $|h_{\hat{\phi}}| < \tau$). The comparison is shown under two hyperparameter settings: (Left) a moderate amplification intensity ($c=2, \tau=7$), and (Right) a high amplification intensity ($c=4, \tau=7$).

Theorems & Definitions (12)

  • Proposition 4.1
  • Proposition 4.2
  • Theorem 4.6
  • Proposition 4.7
  • proof : Proof of Proposition \ref{['proposition1']}
  • proof : Proof of Proposition \ref{['proposition2']}
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • ...and 2 more