Table of Contents
Fetching ...

Active Preference Learning for Large Language Models

William Muldrew, Peter Hayes, Mingtian Zhang, David Barber

TL;DR

This work tackles data-efficiency in aligning large language models by marrying Direct Preference Optimization (DPO) with an Active Preference Learning loop that selectively acquires human or AI preferences. It introduces acquisition functions based on predictive entropy and implicit preference certainty to guide which prompt/completion pairs are labeled, using GPT-4 as the oracle. Empirically, the approach yields faster learning and improved final performance on IMDB and TLDR tasks, with the certainty-based strategy providing consistent gains and analysis showing it surfaces informative, confidently incorrect examples. The study highlights practical directions for scalable, feedback-efficient LLM alignment and points to online and efficiency-enhancing extensions.

Abstract

As large language models (LLMs) become more capable, fine-tuning techniques for aligning with human intent are increasingly important. A key consideration for aligning these models is how to most effectively use human resources, or model resources in the case where LLMs themselves are used as oracles. Reinforcement learning from Human or AI preferences (RLHF/RLAIF) is the most prominent example of such a technique, but is complex and often unstable. Direct Preference Optimization (DPO) has recently been proposed as a simpler and more stable alternative. In this work, we develop an active learning strategy for DPO to make better use of preference labels. We propose a practical acquisition function for prompt/completion pairs based on the predictive entropy of the language model and a measure of certainty of the implicit preference model optimized by DPO. We demonstrate how our approach improves both the rate of learning and final performance of fine-tuning on pairwise preference data.

Active Preference Learning for Large Language Models

TL;DR

This work tackles data-efficiency in aligning large language models by marrying Direct Preference Optimization (DPO) with an Active Preference Learning loop that selectively acquires human or AI preferences. It introduces acquisition functions based on predictive entropy and implicit preference certainty to guide which prompt/completion pairs are labeled, using GPT-4 as the oracle. Empirically, the approach yields faster learning and improved final performance on IMDB and TLDR tasks, with the certainty-based strategy providing consistent gains and analysis showing it surfaces informative, confidently incorrect examples. The study highlights practical directions for scalable, feedback-efficient LLM alignment and points to online and efficiency-enhancing extensions.

Abstract

As large language models (LLMs) become more capable, fine-tuning techniques for aligning with human intent are increasingly important. A key consideration for aligning these models is how to most effectively use human resources, or model resources in the case where LLMs themselves are used as oracles. Reinforcement learning from Human or AI preferences (RLHF/RLAIF) is the most prominent example of such a technique, but is complex and often unstable. Direct Preference Optimization (DPO) has recently been proposed as a simpler and more stable alternative. In this work, we develop an active learning strategy for DPO to make better use of preference labels. We propose a practical acquisition function for prompt/completion pairs based on the predictive entropy of the language model and a measure of certainty of the implicit preference model optimized by DPO. We demonstrate how our approach improves both the rate of learning and final performance of fine-tuning on pairwise preference data.
Paper Structure (26 sections, 9 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 26 sections, 9 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Average self-consistency of preference labels provided by GPT-3 and GPT-4 across 50 prompt completion pairs. Each model provided two preference labels for each prompt completion pair.
  • Figure 2: Win-rate at evaluation waypoints. (a) IMDB is win-rate vs the initial model.(b) TLDR is win-rate vs human provided summaries on the test prompts (b). The x-axis is the size of the acquired dataset used for fine-tuning at the point of evaluation. Each model and dataset pair was trained with 9 random seeds and we plot means with standard errors. Preference certainty and entropy + preference certainty outperform the random baseline.
  • Figure 3: Histograms of probabilities from our implicit Bradley Terry preference model across a batch of acquired data; grouped by incorrect (red) and correct (green) preferences according to the oracle. This assumes a decision threshold of 0.5. Our preference certainty acquisition function surfaces confidently with wrong examples.
  • Figure 4: GPT-4 oracle prompts for sentiment and summarization tasks.
  • Figure 5: Win-rate vs initial model after each acquired batch for IMDB with random and preference certainty acquisition and online fine-tuning. Only a single fine-tuning gradient step is taken on the latest batch.
  • ...and 1 more figures