Table of Contents
Fetching ...

Active Preference Optimization for Sample Efficient RLHF

Nirjhar Das, Souradip Chakraborty, Aldo Pacchiano, Sayak Ray Chowdhury

TL;DR

This work addresses the sample-efficiency challenge in RLHF by reframing alignment as a contextual preference bandit and proving that uniform context sampling is suboptimal under small budgets. It introduces Active Preference Optimization (APO), an adaptive strategy that targets the most uncertain contexts and action pairs, and APO-Gen for general function classes, achieving near-optimal suboptimality gaps up to logarithmic factors. Theoretical lower bounds and matching upper bounds establish the fundamental limits and effectiveness of adaptive sampling, while experiments on sentiment and dialogue tasks demonstrate practical gains over random strategies. Together, these results offer a principled, scalable approach to improving LLM alignment with human preferences under budget constraints.

Abstract

Large Language Models (LLMs) aligned using Reinforcement Learning from Human Feedback (RLHF) have shown remarkable generation abilities in numerous tasks. However, collecting high-quality human preferences creates costly bottlenecks in practical deployments, and hence, training data are often budgeted. In these scenarios, it is crucial to collect training data (e.g., contexts, a pair of generations for each context, and a preference indicating which generation is better) carefully, yet most of the existing methods sample contexts uniformly at random from a given collection. Given this, under the Bradley-Terry-Luce preference model and with a small budget of training data, we show that uniform sampling of contexts could lead to a policy (i.e., an aligned model) that suffers a constant sub-optimality gap from the optimal policy. This highlights the need for an adaptive context sampling strategy for effective alignment under a small sample budget. To address this, we reformulate RLHF within the contextual preference bandit framework, treating generations as actions, and give a nearly complete characterization of the sub-optimality gap in terms of both lower and upper bounds. First, when the action set is a $d$-dimensional hypercube and the number of samples is $T$, we show an $Ω(d/\sqrt{T})$ lower bound. Next, we propose an algorithm, $\textit{Active Preference Optimization}$ ($\texttt{APO}$), that iteratively collects preferences for the most uncertain contexts. We show that the sub-optimality gap of the policy learned via $\texttt{APO}$ matches the lower bound up to a log factor and a non-linearity constant. Finally, we perform experiments on practical datasets to validate $\texttt{APO}$'s efficacy over existing methods, establishing it as a sample-efficient and cost-effective solution for LLM alignment.

Active Preference Optimization for Sample Efficient RLHF

TL;DR

This work addresses the sample-efficiency challenge in RLHF by reframing alignment as a contextual preference bandit and proving that uniform context sampling is suboptimal under small budgets. It introduces Active Preference Optimization (APO), an adaptive strategy that targets the most uncertain contexts and action pairs, and APO-Gen for general function classes, achieving near-optimal suboptimality gaps up to logarithmic factors. Theoretical lower bounds and matching upper bounds establish the fundamental limits and effectiveness of adaptive sampling, while experiments on sentiment and dialogue tasks demonstrate practical gains over random strategies. Together, these results offer a principled, scalable approach to improving LLM alignment with human preferences under budget constraints.

Abstract

Large Language Models (LLMs) aligned using Reinforcement Learning from Human Feedback (RLHF) have shown remarkable generation abilities in numerous tasks. However, collecting high-quality human preferences creates costly bottlenecks in practical deployments, and hence, training data are often budgeted. In these scenarios, it is crucial to collect training data (e.g., contexts, a pair of generations for each context, and a preference indicating which generation is better) carefully, yet most of the existing methods sample contexts uniformly at random from a given collection. Given this, under the Bradley-Terry-Luce preference model and with a small budget of training data, we show that uniform sampling of contexts could lead to a policy (i.e., an aligned model) that suffers a constant sub-optimality gap from the optimal policy. This highlights the need for an adaptive context sampling strategy for effective alignment under a small sample budget. To address this, we reformulate RLHF within the contextual preference bandit framework, treating generations as actions, and give a nearly complete characterization of the sub-optimality gap in terms of both lower and upper bounds. First, when the action set is a -dimensional hypercube and the number of samples is , we show an lower bound. Next, we propose an algorithm, (), that iteratively collects preferences for the most uncertain contexts. We show that the sub-optimality gap of the policy learned via matches the lower bound up to a log factor and a non-linearity constant. Finally, we perform experiments on practical datasets to validate 's efficacy over existing methods, establishing it as a sample-efficient and cost-effective solution for LLM alignment.
Paper Structure (22 sections, 20 theorems, 71 equations, 2 figures, 2 tables, 3 algorithms)

This paper contains 22 sections, 20 theorems, 71 equations, 2 figures, 2 tables, 3 algorithms.

Key Result

theorem thmcountertheorem

There exists a problem instance $(\mathcal{X}, \mathcal{A}, \theta^*)$ for which the policy learnt by a Uniform Learner Alg under the budget $T \ll \lvert \mathcal{X} \rvert$ suffers $\Omega(1)$ sub-optimality gap with high probability.

Figures (2)

  • Figure 1: Top Row: Controlled Sentiment Generation Task: Left: Evaluation accuracy of trained reward model vs. no. of samples (in %) comparing APO with Random. Middle: Sentiment score distribution of aligned policies trained on reward model learned with APO and on Random's highest accuracy reward model. Generations by APO-trained reward is more shifted towards positive, showing better alignment than Random. Right: Win rates of APO, AE-DPOmehta2023sample and APLmuldrew2024active and Random against SFT policy. APO outperforms AE-DPO, APL and Random by $72 : 62 : 56 : 54$ win rate. Bottom Row: Single-turn Dialogue Task: Left and 2nd Left: Evaluation accuracy of trained reward model vs. no. of samples comparing APO with Random, when the number of epochs is 5 (Left) and 20 (2nd Left). Evaluation accuracy of APO is higher than the Random in both cases. 2nd Right: Reward distribution of APO-aligned, SFT and Random-aligned policies for generations on prompts in the test dataset. Clearly, APO's alignment is better than Random. Right: Win rates of APO and Random aligned policies against SFT policy. APO outperforms Random by $55 : 40$ win rate.
  • Figure 2: Visualization of the instance for Theorem \ref{['theorem:lower-bound']}. Here, $z_g$ represents feature difference vectors for good contexts, and $z_b$ represents feature difference for the bad context. $\theta^*$ and $\widehat{\theta}$ are the true and learnt parameters respectively.

Theorems & Definitions (36)

  • remark thmcounterremark
  • definition thmcounterdefinition: Uniform Learner
  • theorem thmcountertheorem: Lower bound for uniform context sampling
  • theorem thmcountertheorem: Lower Bound for any sampling strategy
  • proof : Sketch
  • theorem thmcountertheorem: Sub-optimality gap of APO
  • proof : Sketch
  • lemma thmcounterlemma: Estimation error at round $t$
  • remark thmcounterremark
  • remark thmcounterremark: Extension to Direct Preference Optimization (DPO)
  • ...and 26 more