Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

Adam Barla; Emanuele Nevali; Luca Viano; Volkan Cevher

Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

Adam Barla, Emanuele Nevali, Luca Viano, Volkan Cevher

TL;DR

This work tackles the over-optimization problem in Direct Preference Optimization (DPO) when the data-generating distribution is unknown. It introduces PEPO, a pessimistic ensemble approach that trains multiple DPO-like policies on disjoint data subsets and aggregates them via a worst-case criterion, using a Bradley-Terry model with ties to embed pessimism. In the tabular setting, PEPO achieves theoretical guarantees depending only on the single-policy concentrability $C^\star$, avoiding the all-policy term, and it characterizes the optimal ensemble size needed for pessimism. Empirically, PEPO improves post-training performance across a range of open-source and large-scale models and remains robust under distributional mismatch, with a token-level variant offering practical generation speed. The approach preserves the simplicity of DPO while delivering provable robustness to over-optimization in settings where $\pi_{\mathrm{data}}$ is inaccessible.

Abstract

We introduce PEPO (Pessimistic Ensemble based Preference Optimization), a single-step Direct Preference Optimization (DPO)-like algorithm to mitigate the well-known over-optimization issue in preference learning without requiring the knowledge of the data-generating distribution or learning an explicit reward model. PEPO achieves pessimism via an ensemble of preference-optimized policies trained on disjoint data subsets and then aggregates them through a worst case construction that favors the agreement across models. In the tabular setting, PEPO achieves sample complexity guarantees depending only on a single-policy concentrability coefficient, thus avoiding the all-policy concentrability which affects the guarantees of algorithms prone to over-optimization, such as DPO. The theoretical findings are corroborated by a convincing practical performance, while retaining the simplicity and the practicality of DPO-style training.

Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

TL;DR

, avoiding the all-policy term, and it characterizes the optimal ensemble size needed for pessimism. Empirically, PEPO improves post-training performance across a range of open-source and large-scale models and remains robust under distributional mismatch, with a token-level variant offering practical generation speed. The approach preserves the simplicity of DPO while delivering provable robustness to over-optimization in settings where

is inaccessible.

Abstract

Paper Structure (57 sections, 20 theorems, 157 equations, 6 figures, 3 tables, 2 algorithms)

This paper contains 57 sections, 20 theorems, 157 equations, 6 figures, 3 tables, 2 algorithms.

Introduction
Theoretical contribution
Practical contribution
Paper Organization
Preliminaries
Reinforcement Learning from Human Feedback: RLHF
Direct Preference Optimization
Limitations of RLHF and DPO (i.e., the lack of pessimism):
The algorithm
Bradley-Terry model with ties
Pessimistic interpretation of shifted sigmoid functions
Creating the models ensemble
Pessimism via ensemble
Efficient Sampling from $\boldsymbol{\pi_{\mathrm{out}}}$
Discussion of the hyperparameter
...and 42 more sections

Key Result

Lemma 3.2

The policy $\pi_{\mathrm{out}}$ defined as solution of the optimization problem in eq:outputDPO satisfies for all $x,a\in \mathcal{X}\times{\mathcal{A}}$

Figures (6)

Figure 1: Win rate (%) against the initial model on AlpacaEval across training epochs. Shaded regions indicate the standard error of the mean across per-instruction preferences. We compare DPO and PEPO with varying ensemble sizes $L$. PEPO consistently outperforms DPO, with more pronounced improvements on models with stronger initial GPT-4 win rates (Yi-34B-Chat (27.5%) and Llama-3.1-Tulu-3-8B-SFT (8.6%) show larger gains compared to Mistral-7B (4.1%) and Zephyr-7B (3.5%)).
Figure 2: Experiment in $3$ arms bandit setting for the situation of known and unknown $\pi_{\mathrm{data}}$.
Figure 3: Ablation for $L$ in a bandit setting.
Figure 4: Result in the controlled setting without regularization, i.e. with $\beta=0$.
Figure 5: Illustration of the tie probability (missing mass) resulting from a right shifting of the sigmoid function. For our analysis the important feature is that the shifted sigmoid (in red) lower bounds the standard sigmoid in blue. In contrast, in previous literature this change has been often motivated with the argument that this serves to induce a larger margin on the horizontal axis.
...and 1 more figures

Theorems & Definitions (35)

Definition 2.1
Definition 3.1
Lemma 3.2
Lemma 3.3
Theorem 4.1
Lemma 4.2
Lemma 4.3
Lemma 4.4
proof
Lemma E.1
...and 25 more

Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

TL;DR

Abstract

Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (35)