Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution
Adam Barla, Emanuele Nevali, Luca Viano, Volkan Cevher
TL;DR
This work tackles the over-optimization problem in Direct Preference Optimization (DPO) when the data-generating distribution is unknown. It introduces PEPO, a pessimistic ensemble approach that trains multiple DPO-like policies on disjoint data subsets and aggregates them via a worst-case criterion, using a Bradley-Terry model with ties to embed pessimism. In the tabular setting, PEPO achieves theoretical guarantees depending only on the single-policy concentrability $C^\star$, avoiding the all-policy term, and it characterizes the optimal ensemble size needed for pessimism. Empirically, PEPO improves post-training performance across a range of open-source and large-scale models and remains robust under distributional mismatch, with a token-level variant offering practical generation speed. The approach preserves the simplicity of DPO while delivering provable robustness to over-optimization in settings where $\pi_{\mathrm{data}}$ is inaccessible.
Abstract
We introduce PEPO (Pessimistic Ensemble based Preference Optimization), a single-step Direct Preference Optimization (DPO)-like algorithm to mitigate the well-known over-optimization issue in preference learning without requiring the knowledge of the data-generating distribution or learning an explicit reward model. PEPO achieves pessimism via an ensemble of preference-optimized policies trained on disjoint data subsets and then aggregates them through a worst case construction that favors the agreement across models. In the tabular setting, PEPO achieves sample complexity guarantees depending only on a single-policy concentrability coefficient, thus avoiding the all-policy concentrability which affects the guarantees of algorithms prone to over-optimization, such as DPO. The theoretical findings are corroborated by a convincing practical performance, while retaining the simplicity and the practicality of DPO-style training.
