Active Learning for Direct Preference Optimization
Branislav Kveton, Xintong Li, Julian McAuley, Ryan Rossi, Jingbo Shang, Junda Wu, Tong Yu
TL;DR
This work introduces an active learning framework for Direct Preference Optimization (DPO) by linearizing the DPO objective at the final-layer representation and formulating a D-optimal design to select the most informative preferential feedback. It presents two practical algorithms, ADPO for online feedback and ADPO$^+$ for offline feedback, with theoretical guarantees showing the maximum logit error decays as ${\tilde{O}}(d/\sqrt{n})$ under a log-linear policy model. Empirically, the methods improve data efficiency and policy performance on both synthetic log-linear tasks and large language model settings, including real-world preference data. The results advance principled, information-theoretic data selection for aligning LLMs with human preferences, enabling more efficient RLHF and DPO pipelines.
Abstract
Direct preference optimization (DPO) is a form of reinforcement learning from human feedback (RLHF) where the policy is learned directly from preferential feedback. Although many models of human preferences exist, the critical task of selecting the most informative feedback for training them is under-explored. We propose an active learning framework for DPO, which can be applied to collect human feedback online or to choose the most informative subset of already collected feedback offline. We propose efficient algorithms for both settings. The key idea is to linearize the DPO objective at the last layer of the neural network representation of the optimized policy and then compute the D-optimal design to collect preferential feedback. We prove that the errors in our DPO logit estimates diminish with more feedback. We show the effectiveness of our algorithms empirically in the setting that matches our theory and also on large language models.
