Meta-Learning Objectives for Preference Optimization
Carlo Alfano, Silvia Sapora, Jakob Nicolaus Foerster, Patrick Rebeschini, Yee Whye Teh
TL;DR
The paper addresses the problem of efficiently evaluating and designing preference optimization (PO) algorithms for model alignment, proposing a diagnostic MuJoCo benchmark to study PO behavior. It introduces Mirror Preference Optimization (MPO), a mirror-descent-based framework that generalizes DPO and ORPO, and uses evolutionary strategies to meta-learn the objective components $g$, $\psi$, and $\phi^{-1}$, enabling dataset- and task-specific PO losses. A temporally aware variant, TA-MPO, further improves stability by gradually shifting from imitation to PO, and its effectiveness is demonstrated on MuJoCo tasks as well as transferred to LLM alignment tasks such as AlpacaEval. The results show that discovered MPO objectives often outperform hand-designed baselines, particularly in noisy or mixed-quality data, and that temporal awareness combined with SFT improves robustness and performance, suggesting broad applicability for offline PO in real-world alignment settings. The work provides a scalable methodology for automatically designing PO losses and highlights the value of continuing policy optimization beyond initial wins, with practical impact for safer and more effective AI systems.
Abstract
Evaluating preference optimization (PO) algorithms on LLM alignment is a challenging task that presents prohibitive costs, noise, and several variables like model size and hyper-parameters. In this work, we show that it is possible to gain insights on the efficacy of PO algorithm on simpler benchmarks. We design a diagnostic suite of MuJoCo tasks and datasets, which we use to systematically evaluate PO algorithms, establishing a more controlled and cheaper benchmark. We then propose a novel family of PO algorithms based on mirror descent, which we call Mirror Preference Optimization (MPO). Through evolutionary strategies, we search this class to discover algorithms specialized to specific properties of preference datasets, such as mixed-quality or noisy data. We demonstrate that our discovered PO algorithms outperform all known algorithms in the targeted MuJoCo settings. Finally, based on the insights gained from our MuJoCo experiments, we design a PO algorithm that significantly outperform existing baselines in an LLM alignment task.
