Reinforcement Learning from Human Feedback with Active Queries

Kaixuan Ji; Jiafan He; Quanquan Gu

Reinforcement Learning from Human Feedback with Active Queries

Kaixuan Ji, Jiafan He, Quanquan Gu

TL;DR

<3-5 sentence high-level summary> Addressing the data efficiency challenge of RLHF, this paper reframes alignment as a contextual dueling bandit and introduces APPO, a query-efficient method with instance-dependent regret guarantees. It further provides ADPO, a practical DPO-based approach that leverages pseudo-labels and uncertainty-based querying to drastically reduce human-labels while maintaining strong performance. Theoretical results establish $\widetilde{O}(d^2/\Delta)$ regret and $\widetilde{O}(d^2/\Delta^2)$ query complexity for APPO, and experiments on Zephyr models show ADPO achieving comparable or superior results with about half the queries. Together, these contributions advance scalable, human-preference-aligned LLMs by reducing labeling costs without sacrificing alignment quality.

Abstract

Aligning large language models (LLM) with human preference plays a key role in building modern generative models and can be achieved by reinforcement learning from human feedback (RLHF). Despite their superior performance, current RLHF approaches often require a large amount of human-labelled preference data, which is expensive to collect. In this paper, inspired by the success of active learning, we address this problem by proposing query-efficient RLHF methods. We first formalize the alignment problem as a contextual dueling bandit problem and design an active-query-based proximal policy optimization (APPO) algorithm with an $\tilde{O}(d^2/Δ)$ instance-dependent regret bound and an $\tilde{O}(d^2/Δ^2)$ query complexity, where $d$ is the dimension of feature space and $Δ$ is the sub-optimality gap over all the contexts. We then propose ADPO, a practical version of our algorithm based on direct preference optimization (DPO) and apply it to fine-tuning LLMs. Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of the state-of-the-art DPO method.

Reinforcement Learning from Human Feedback with Active Queries

TL;DR

regret and

query complexity for APPO, and experiments on Zephyr models show ADPO achieving comparable or superior results with about half the queries. Together, these contributions advance scalable, human-preference-aligned LLMs by reducing labeling costs without sacrificing alignment quality.

Abstract

instance-dependent regret bound and an

query complexity, where

is the dimension of feature space and

is the sub-optimality gap over all the contexts. We then propose ADPO, a practical version of our algorithm based on direct preference optimization (DPO) and apply it to fine-tuning LLMs. Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of the state-of-the-art DPO method.

Paper Structure (46 sections, 13 theorems, 66 equations, 3 figures, 5 tables, 2 algorithms)

This paper contains 46 sections, 13 theorems, 66 equations, 3 figures, 5 tables, 2 algorithms.

Introduction
Notation
Related Work
Reinforcement Learning from Human Feedback
Dueling Bandits
Active Learning
Preliminaries
Algorithm
Regularized MLE Estimator
Uncertainty-Aware Query Criterion
Proximal Policy Optimization
Theoretical Analysis
Practical Algorithm
Direct Preference Optimization
Confidence Estimator
...and 31 more sections

Key Result

Theorem 5.1

Let $\Delta$ be the minimal sub-optimal gap in Assumption assumption:gap. If we set the parameters $\Gamma= \widetilde{O}(\Delta/\sqrt{{d}})$, $\lambda=B^{-2}$, $\eta=\widetilde{O}(\sqrt{\Gamma^2 \log \mathcal{A} /d})$, and $\beta=\widetilde{O}(\sqrt{d}/\kappa_{\sigma})$ in Algorithm algo:po, then w In addition, the query complexity of Algorithm algo:po is upper bounded by:

Figures (3)

Figure 1: The test accuracy curve of DPO and $\text{ADPO}$ starting from Zephyr-Beta-SFT. The x-axis is the number of queries and the y-axis is the metric for corresponding dataset. Compared to DPO, $\text{ADPO}$ enjoys a faster performance improvement and a higher performance upper bound.
Figure 2: The test accuracy curve of DPO and $\text{ADPO}$ starting from Zephyr-Gemma-SFT. The x-axis is the number of queries and the y-axis is the metric for corresponding dataset. Compared to DPO, $\text{ADPO}$ enjoys a faster performance improvement and a higher performance upper bound.
Figure 3: The test accuracy curve of DPO, $\text{ADPO}$ (w/o PL) and $\text{ADPO}$ under LoRA-finetune. The x-axis is the number of queries and the y-axis is the metric for corresponding dataset. Compared to DPO and $\text{ADPO}$ (w/o PL), $\text{ADPO}$ enjoys a faster performance improvement and a higher performance upper bound.

Theorems & Definitions (17)

Definition 3.3: Minimal sub-optimality gap
Remark 3.5
Theorem 5.1
Remark 5.2
Remark 5.3
Lemma E.1: Modified from Lemma 4.5, zhang2023interplay
Lemma E.2
Lemma E.3
Lemma E.4
Lemma E.5: Modified from Lemma 6.2, he2022near
...and 7 more

Reinforcement Learning from Human Feedback with Active Queries

TL;DR

Abstract

Reinforcement Learning from Human Feedback with Active Queries

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (17)