Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

Hyung Gyu Rho

Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

Hyung Gyu Rho

TL;DR

MADPO tackles the brittleness of Direct Preference Optimization caused by a fixed temperature by introducing an instance-level, margin-adaptive reweighting scheme. It learns per-sample margins with a reward model and then applies a continuous weight to the DPO loss, amplifying informative low-margin pairs while dampening easy high-margin ones. Theoretical results establish a stable optimization landscape and robustness to reward-estimation errors, while experiments on synthetic IMDB sentiment data show MADPO outperforms DPO, IPO, and $\beta$-DPO across data qualities, with notable gains on high-quality data. Collectively, MADPO offers a principled, robust method for fine-grained preference alignment with practical gains and stability benefits.

Abstract

Direct Preference Optimization (DPO) has emerged as a simple and effective method for aligning large language models. However, its reliance on a fixed temperature parameter leads to suboptimal training on diverse preference data, causing overfitting on easy examples and under-learning from informative ones. Recent methods have emerged to counter this. While IPO addresses general overfitting, its uniform regularization can be overly conservative. The more targeted approach of $β$-DPO suffers from its own limitations: its batch-level adaptation applies a single, compromised temperature to mixed-margin pairs, its linear update rule can produce unstable negative $β$ values, and its filtering mechanism discards potentially useful training signals. In this work, we introduce Margin-Adaptive Direct Preference Optimization (MADPO), a method that provides a stable, data-preserving, and instance-level solution. MADPO employs a practical two-step approach: it first trains a reward model to estimate preference margins and then uses these margins to apply a continuous, adaptive weight to the DPO loss for each individual training sample. This re-weighting scheme creates an effective target margin that is amplified for hard pairs and dampened for easy pairs, allowing for granular control over the learning signal. We provide a comprehensive theoretical analysis, proving that MADPO has a well-behaved optimization landscape and is robust to reward model estimation errors. We validate our theory with experiments on a sentiment generation task, where MADPO consistently and significantly outperforms strong baselines across datasets of varying quality. It achieves performance gains of up to +33.3\% on High Quality data and +10.5\% on Low Quality data over the next-best method. Our results establish MADPO as a more robust and principled approach to preference alignment.

Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

TL;DR

-DPO across data qualities, with notable gains on high-quality data. Collectively, MADPO offers a principled, robust method for fine-grained preference alignment with practical gains and stability benefits.

Abstract

-DPO suffers from its own limitations: its batch-level adaptation applies a single, compromised temperature to mixed-margin pairs, its linear update rule can produce unstable negative

values, and its filtering mechanism discards potentially useful training signals. In this work, we introduce Margin-Adaptive Direct Preference Optimization (MADPO), a method that provides a stable, data-preserving, and instance-level solution. MADPO employs a practical two-step approach: it first trains a reward model to estimate preference margins and then uses these margins to apply a continuous, adaptive weight to the DPO loss for each individual training sample. This re-weighting scheme creates an effective target margin that is amplified for hard pairs and dampened for easy pairs, allowing for granular control over the learning signal. We provide a comprehensive theoretical analysis, proving that MADPO has a well-behaved optimization landscape and is robust to reward model estimation errors. We validate our theory with experiments on a sentiment generation task, where MADPO consistently and significantly outperforms strong baselines across datasets of varying quality. It achieves performance gains of up to +33.3\% on High Quality data and +10.5\% on Low Quality data over the next-best method. Our results establish MADPO as a more robust and principled approach to preference alignment.

Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

TL;DR

Abstract

Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (12)