Active Preference Optimization for Sample Efficient RLHF
Nirjhar Das, Souradip Chakraborty, Aldo Pacchiano, Sayak Ray Chowdhury
TL;DR
This work addresses the sample-efficiency challenge in RLHF by reframing alignment as a contextual preference bandit and proving that uniform context sampling is suboptimal under small budgets. It introduces Active Preference Optimization (APO), an adaptive strategy that targets the most uncertain contexts and action pairs, and APO-Gen for general function classes, achieving near-optimal suboptimality gaps up to logarithmic factors. Theoretical lower bounds and matching upper bounds establish the fundamental limits and effectiveness of adaptive sampling, while experiments on sentiment and dialogue tasks demonstrate practical gains over random strategies. Together, these results offer a principled, scalable approach to improving LLM alignment with human preferences under budget constraints.
Abstract
Large Language Models (LLMs) aligned using Reinforcement Learning from Human Feedback (RLHF) have shown remarkable generation abilities in numerous tasks. However, collecting high-quality human preferences creates costly bottlenecks in practical deployments, and hence, training data are often budgeted. In these scenarios, it is crucial to collect training data (e.g., contexts, a pair of generations for each context, and a preference indicating which generation is better) carefully, yet most of the existing methods sample contexts uniformly at random from a given collection. Given this, under the Bradley-Terry-Luce preference model and with a small budget of training data, we show that uniform sampling of contexts could lead to a policy (i.e., an aligned model) that suffers a constant sub-optimality gap from the optimal policy. This highlights the need for an adaptive context sampling strategy for effective alignment under a small sample budget. To address this, we reformulate RLHF within the contextual preference bandit framework, treating generations as actions, and give a nearly complete characterization of the sub-optimality gap in terms of both lower and upper bounds. First, when the action set is a $d$-dimensional hypercube and the number of samples is $T$, we show an $Ω(d/\sqrt{T})$ lower bound. Next, we propose an algorithm, $\textit{Active Preference Optimization}$ ($\texttt{APO}$), that iteratively collects preferences for the most uncertain contexts. We show that the sub-optimality gap of the policy learned via $\texttt{APO}$ matches the lower bound up to a log factor and a non-linearity constant. Finally, we perform experiments on practical datasets to validate $\texttt{APO}$'s efficacy over existing methods, establishing it as a sample-efficient and cost-effective solution for LLM alignment.
