Sample Efficient Preference Alignment in LLMs via Active Exploration

Viraj Mehta; Syrine Belakaria; Vikramjeet Das; Ojash Neopane; Yijia Dai; Ilija Bogunovic; Barbara Engelhardt; Stefano Ermon; Jeff Schneider; Willie Neiswanger

Sample Efficient Preference Alignment in LLMs via Active Exploration

Viraj Mehta, Syrine Belakaria, Vikramjeet Das, Ojash Neopane, Yijia Dai, Ilija Bogunovic, Barbara Engelhardt, Stefano Ermon, Jeff Schneider, Willie Neiswanger

TL;DR

The paper tackles the high cost of aligning LLMs to user preferences by enabling sample-efficient data selection through an Active Contextual Dueling Bandit framework. It introduces AE-Borda, a kernelized method with contextual Borda function estimation and uncertainty-guided exploration, plus online and offline extensions using Direct Preference Optimization. The authors provide regret guarantees and demonstrate practical gains on both synthetic kernelized tasks and multiple LLM datasets, including two newly contributed datasets Jeopardy! and Haikus, with improved performance under limited human-feedback budgets and better hallucination avoidance. This work offers theoretical and algorithmic tools to scale preference alignment in real-world LLM deployments while reducing annotation requirements.

Abstract

Preference-based feedback is important for many applications in machine learning where evaluation of a reward function is not feasible. Notable recent examples arise in preference alignment for large language models, including in reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO). For many applications of preference alignment, the cost of acquiring human feedback can be substantial. In this work, we take advantage of the fact that one can often choose contexts at which to obtain human feedback to most efficiently identify a good policy, and formalize the setting as an active contextual dueling bandit problem. We propose an active exploration algorithm to efficiently select the data and provide theoretical proof that it has a polynomial worst-case regret bound. We extend the setting and methodology for practical use in preference alignment of large language models. We provide two extensions, an online and an offline approach. Our method outperforms the baselines with limited samples of human preferences on several language models and four real-world datasets including two new datasets that we contribute to the literature.

Sample Efficient Preference Alignment in LLMs via Active Exploration

TL;DR

Abstract

Paper Structure (42 sections, 3 theorems, 25 equations, 12 figures, 1 table, 3 algorithms)

This paper contains 42 sections, 3 theorems, 25 equations, 12 figures, 1 table, 3 algorithms.

Introduction
Related Work
Learning from Comparative Feedback
Dueling Bandits
Active Contextual Bandit Optimization
Problem Setting
Active Exploration in the Kernelized Setting
The Contextual Borda Function
Methods
Estimating the Contextual Borda Function
Selecting Contexts and Actions
Theoretical Analysis
Proof Overview.
Concrete Performance Bounds.
Scaling Active Exploration to Large Language Models
...and 27 more sections

Key Result

Theorem 4.3

Suppose we run Algorithm alg:Borda-AE with then, with probability at least $1 - \delta$, we have that

Figures (12)

Figure 1: Illustration of the active contextual dueling bandit setting, and its application to sample-efficient preference alignment in large language models.
Figure 2: Performance of all methods across 10 random functions $r$ with 1D contexts and 1D actions. The left plot shows the median regret across contexts and the right shows the maximum. Error bands show one standard error.
Figure 3: Visualizing the Borda function estimate with 100 data points. Left: the ground truth contextual Borda function $f_r$ (red line is the optimal policy). Right: the mean of our posterior estimate of $f_r$ (red line is the best policy estimate). Red dots are queries where $w_t = 0$ and green are where $w_t = 1$. For a full version, see Fig. \ref{['a:kocdb_addtl_experiments']}.
Figure 4: Active Exploration for DPO in LLMs. For multiple models and datasets we compare AE-Borda-DPO vs Uniform-DPO dpo. In the first six plots we show the average win rate over supervised fine-tuning (SFT), and in the final two plots (on the Jeopardy! dataset) show Null Rate given an incorrect answer (I.A.).
Figure 5: Progress of AE-Borda across 50, 150, and 600 datapoints. From the top downwards, the charts show the ground truth function, the mean of the posterior estimate of $f_r$, the uncertainty function, the estimate of the value function as well as the acquisition function given in Eq. (\ref{['eq:context_selection']}), and the regret over time.
...and 7 more figures

Theorems & Definitions (5)

Theorem 4.3
Lemma B.1
proof
Lemma B.2
proof

Sample Efficient Preference Alignment in LLMs via Active Exploration

TL;DR

Abstract

Sample Efficient Preference Alignment in LLMs via Active Exploration

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (5)