Not All Preference Pairs Are Created Equal: A Recipe for Annotation-Efficient Iterative Preference Learning

Sen Yang; Leyang Cui; Deng Cai; Xinting Huang; Shuming Shi; Wai Lam

Not All Preference Pairs Are Created Equal: A Recipe for Annotation-Efficient Iterative Preference Learning

Sen Yang, Leyang Cui, Deng Cai, Xinting Huang, Shuming Shi, Wai Lam

TL;DR

This work proposes a comparative view to rank the implicit reward margins as predicted by DPO to select the response pairs that yield more benefits and shows that annotating those response pairs with small margins is generally better than large or random, under both single- and multi-iteration scenarios.

Abstract

Iterative preference learning, though yielding superior performances, requires online annotated preference labels. In this work, we study strategies to select worth-annotating response pairs for cost-efficient annotation while achieving competitive or even better performances compared with the random selection baseline for iterative preference learning. Built on assumptions regarding uncertainty and distribution shifts, we propose a comparative view to rank the implicit reward margins as predicted by DPO to select the response pairs that yield more benefits. Through extensive experiments, we show that annotating those response pairs with small margins is generally better than large or random, under both single- and multi-iteration scenarios. Besides, our empirical results suggest allocating more annotation budgets in the earlier iterations rather than later across multiple iterations.

Not All Preference Pairs Are Created Equal: A Recipe for Annotation-Efficient Iterative Preference Learning

TL;DR

Abstract

Paper Structure (45 sections, 5 equations, 5 figures, 3 tables)

This paper contains 45 sections, 5 equations, 5 figures, 3 tables.

Introduction
Preliminaries
Direct Preference Learning
Online Iterative DPO
Step-$\bm{i}$
Margin-based Selection within One Iteration
Uncertainty
Distribution Shift
Strategy Variants
Instance-level
Corpus-level
Margin Normalization
Experimental Setup
Synthetic Oracle
Training
...and 30 more sections

Figures (5)

Figure 1: The workflow of online iterative preference learning, in which we apply two levels of selection before annotation.
Figure 2: Results on AlpacaEval-2.0 with different instance-level strategies and different training set sizes.
Figure 3: Results on AlpacaEval-2.0 with different corpus-level strategies and different training set sizes.
Figure 4: Some statistics about the selected subset using different strategies combined. The collected subsets are used to train $\pi_\theta^1$ under the single-iteration setting.
Figure 5: Multi-iteration results on AlpacaEval-2.0 with always-random and always-smallest strategies, respectively, across three follow-up iterations, with 5k instances (originally 10k instructions) per iteration. A single-iter baseline, which is trained by using all the instructions with the always-smallest strategy within a single round, is also included for comparison.

Not All Preference Pairs Are Created Equal: A Recipe for Annotation-Efficient Iterative Preference Learning

TL;DR

Abstract

Not All Preference Pairs Are Created Equal: A Recipe for Annotation-Efficient Iterative Preference Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)