$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization
Jiaqi Han, Mingjian Jiang, Yuxuan Song, Stefano Ermon, Minkai Xu
TL;DR
This work introduces $f$-PO, a distribution-matching framework for language-model alignment that minimizes $\mathbb{D}_f(\hat{\pi}_\theta\|\hat{\pi}^*)$ between the parameterized policy and the optimal policy. By defining $\hat{\pi}_\theta(y|x) \propto \pi_\theta(y|x)^{\beta} \pi_{ref}(y|x)^{1-\beta}$ and $\hat{\pi}^* \propto \pi_{ref}(y|x) \exp(r(x,y))$, the approach generalizes and unifies existing offline preference methods such as DPO (reverse KL) and EXO (forward KL) and enables new divergences like $\alpha$-divergence. The authors provide theoretical results linking the $f$-PO objective to $f$-divergence between distributions, derive practical objectives for pairwise preference data, and perform extensive experiments showing that $\alpha$-PO often yields state-of-the-art or competitive performance on benchmarks including AlpacaEval 2, Arena-Hard, MT-Bench, and Open LLM Leaderboard v2, with ablations highlighting the effects of divergence choice. They also propose empirical refinements that improve training efficiency by reducing reliance on the reference model. Overall, $f$-PO offers a principled, flexible, and empirically effective framework for offline preference optimization and LM alignment.
Abstract
Preference optimization has made significant progress recently, with numerous methods developed to align language models with human preferences. This paper introduces $f$-divergence Preference Optimization ($f$-PO), a novel framework that generalizes and extends existing approaches. $f$-PO minimizes $f$-divergences between the optimized policy and the optimal policy, encompassing a broad family of alignment methods using various divergences. Our approach unifies previous algorithms like DPO and EXO, while offering new variants through different choices of $f$-divergences. We provide theoretical analysis of $f$-PO's properties and conduct extensive experiments on state-of-the-art language models using benchmark datasets. Results demonstrate $f$-PO's effectiveness across various tasks, achieving superior performance compared to existing methods on popular benchmarks such as AlpacaEval 2, Arena-Hard, MT-Bench, and Open LLM Leaderboard v2. Additionally, we present ablation studies exploring the impact of different $f$-divergences, offering insights into the trade-offs between regularization and performance in offline preference optimization. Our work contributes both practical algorithms and theoretical understanding to the field of language model alignment. Code is available at https://github.com/MinkaiXu/fPO.
