ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

Davit Melikidze; Marian Schneider; Jessica Lam; Martin Wertich; Ido Hakimi; Barna Pásztor; Andreas Krause

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

Davit Melikidze, Marian Schneider, Jessica Lam, Martin Wertich, Ido Hakimi, Barna Pásztor, Andreas Krause

TL;DR

This work introduces ACTIVEULTRAFEEDBACK, a modular active learning pipeline that leverages uncertainty estimates to dynamically identify the most informative responses for annotation, and demonstrates that ACTIVEULTRAFEEDBACK yields high-quality datasets that lead to significant improvements in downstream performance.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has become the standard for aligning Large Language Models (LLMs), yet its efficacy is bottlenecked by the high cost of acquiring preference data, especially in low-resource and expert domains. To address this, we introduce ACTIVEULTRAFEEDBACK, a modular active learning pipeline that leverages uncertainty estimates to dynamically identify the most informative responses for annotation. Our pipeline facilitates the systematic evaluation of standard response selection methods alongside DOUBLE REVERSE THOMPSON SAMPLING (DRTS) and DELTAUCB, two novel methods prioritizing response pairs with large predicted quality gaps, leveraging recent results showing that such pairs provide good signals for fine-tuning. Our experiments demonstrate that ACTIVEULTRAFEEDBACK yields high-quality datasets that lead to significant improvements in downstream performance, notably achieving comparable or superior results with as little as one-sixth of the annotated data relative to static baselines. Our pipeline is available at https://github.com/lasgroup/ActiveUltraFeedback and our preference datasets at https://huggingface.co/ActiveUltraFeedback.

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

TL;DR

Abstract

Paper Structure (56 sections, 8 equations, 9 figures, 23 tables, 5 algorithms)

This paper contains 56 sections, 8 equations, 9 figures, 23 tables, 5 algorithms.

Introduction
Related Work
Background
The ActiveUltraFeedback Pipeline
Response Generation
Reward Prediction
Response Pair Selection
Baseline Heuristics
Dueling Bandit Methods
Active Delta Learning Methods
Preference Annotation
Reward Model Training
Evaluation
Implementation Details
Datasets
...and 41 more sections

Figures (9)

Figure 1: Comparison of response pair selection methods on downstream and reward model benchmarks deployed in ActiveUltraFeedback. The scores have been averaged over four datasets (see \ref{['sec:input_prompt_dataset_ablation']}) of different scales, and indicate improvement over the base model. * denotes an existing dueling bandit method and † indicates our novel active delta learning methods.
Figure 2: The ActiveUltraFeedback pipeline. For each prompt, responses are generated from a large pool of LLMs, the rewards for the responses are predicted with corresponding uncertainties, and a pair of responses is selected for preference annotation. Each new batch of preference data is used to train the reward model, improving the accuracy of reward and uncertainty estimates for subsequent iterations. The displayed procedure is performed in a looping manner until all prompts have been processed.
Figure 3: Mean performance trajectories for fine-tuned and reward models as a function of consumed samples on UltraFeedback prompts. We compare datasets generated via ActiveUltraFeedback using various response pair selection methods. We provide the scores achieved using the UltraFeedback dataset cui2024ultrafeedbackboostinglanguagemodels with the original response pairs.
Figure 4: Benchmarking of downstream and reward model performance across input prompt datasets, increasing in scale from left to right. Scores are reported as relative deltas to the base model. We provide the scores achieved using the original preference dataset instead of just the prompts with ActiveUltraFeedback for reference.
Figure 5: Mean performance trajectories for of models fine-tuned using IPO (\ref{['fig:po_ablation_ipo_simpo_sample_effiency_ipo']}) and SimPO (\ref{['fig:po_ablation_ipo_simpo_sample_effiency_simpo']}) as a function of consumed samples on datasets generated using ActiveUltraFeedback based on UltraFeedback prompts. We provide the scores achieved using the original preference dataset instead of just the prompts with ActiveUltraFeedback for reference.
...and 4 more figures

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

TL;DR

Abstract

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (9)