Table of Contents
Fetching ...

Sample-Efficient Alignment for LLMs

Zichen Liu, Changyu Chen, Chao Du, Wee Sun Lee, Min Lin

TL;DR

A unified algorithm based on Thompson sampling is introduced, demonstrating that SEA achieves highly sample-efficient alignment with oracle's preferences, outperforming recent active exploration methods for LLMs.

Abstract

We study methods for efficiently aligning large language models (LLMs) with human preferences given budgeted online feedback. We first formulate the LLM alignment problem in the frame of contextual dueling bandits. This formulation, subsuming recent paradigms such as online RLHF and online DPO, inherently quests for sample-efficient algorithms that incorporate online active exploration. Leveraging insights from bandit theory, we introduce a unified algorithm based on Thompson sampling and highlight its applications in two distinct LLM alignment scenarios. The practical agent that efficiently implements this algorithm, named SEA (Sample-Efficient Alignment), is empirically validated through extensive experiments across three model scales (1B, 2.8B, 6.9B) and three preference learning algorithms (DPO, IPO, SLiC). The results demonstrate that SEA achieves highly sample-efficient alignment with oracle's preferences, outperforming recent active exploration methods for LLMs. Additionally, we release the implementation of SEA together with an efficient codebase designed for online alignment of LLMs, aiming to accelerate future research in this field.

Sample-Efficient Alignment for LLMs

TL;DR

A unified algorithm based on Thompson sampling is introduced, demonstrating that SEA achieves highly sample-efficient alignment with oracle's preferences, outperforming recent active exploration methods for LLMs.

Abstract

We study methods for efficiently aligning large language models (LLMs) with human preferences given budgeted online feedback. We first formulate the LLM alignment problem in the frame of contextual dueling bandits. This formulation, subsuming recent paradigms such as online RLHF and online DPO, inherently quests for sample-efficient algorithms that incorporate online active exploration. Leveraging insights from bandit theory, we introduce a unified algorithm based on Thompson sampling and highlight its applications in two distinct LLM alignment scenarios. The practical agent that efficiently implements this algorithm, named SEA (Sample-Efficient Alignment), is empirically validated through extensive experiments across three model scales (1B, 2.8B, 6.9B) and three preference learning algorithms (DPO, IPO, SLiC). The results demonstrate that SEA achieves highly sample-efficient alignment with oracle's preferences, outperforming recent active exploration methods for LLMs. Additionally, we release the implementation of SEA together with an efficient codebase designed for online alignment of LLMs, aiming to accelerate future research in this field.

Paper Structure

This paper contains 25 sections, 10 equations, 10 figures, 2 tables, 2 algorithms.

Figures (10)

  • Figure 1: Win rate comparison of model responses against reference responses on the TL;DR task, judged by the preference oracle. All compared methods use the same optimization method (DPO). (Left) Performance improvements at convergence over SFT models achieved by offline (Offline DPO), passively online (Online DPO), and our active exploration (SEA DPO) methods. (Right) The number of queries required by the passively online method (Passive) versus that by different active exploration methods to attain various levels of win rates. SEA achieves the best sample efficiency for online alignment compared to XPO and APL.
  • Figure 2: Illustrative comparison between CDB and LLM alignment.
  • Figure 3: Different paradigms for solving the LLM alignment problem in the CDB framework. Note that although all paradigms follow the LLM alignment interface (\ref{['fig:interface']}) with the interaction loop, some are actually offline or iteratively online (i.e., loop only once or a few times). Detailed comparisons will be made in \ref{['sec:existing_work']}. We use colors to denote learnable components, RL optimizer, direct optimizer, and active exploration. $r_\phi$ denotes a point estimate of human's implicit reward, while ${\mathcal{R}}_\Phi$ refers to an uncertainty-aware reward model.
  • Figure 4: The learning system for experimenting online LLM alignment algorithms.
  • Figure 5: Win rate comparison of different algorithms against their initial SFT models across three scales and three direct optimizers.
  • ...and 5 more figures