Table of Contents
Fetching ...

Reinforcement Learning from Human Feedback with Active Queries

Kaixuan Ji, Jiafan He, Quanquan Gu

TL;DR

<3-5 sentence high-level summary> Addressing the data efficiency challenge of RLHF, this paper reframes alignment as a contextual dueling bandit and introduces APPO, a query-efficient method with instance-dependent regret guarantees. It further provides ADPO, a practical DPO-based approach that leverages pseudo-labels and uncertainty-based querying to drastically reduce human-labels while maintaining strong performance. Theoretical results establish $\widetilde{O}(d^2/\Delta)$ regret and $\widetilde{O}(d^2/\Delta^2)$ query complexity for APPO, and experiments on Zephyr models show ADPO achieving comparable or superior results with about half the queries. Together, these contributions advance scalable, human-preference-aligned LLMs by reducing labeling costs without sacrificing alignment quality.

Abstract

Aligning large language models (LLM) with human preference plays a key role in building modern generative models and can be achieved by reinforcement learning from human feedback (RLHF). Despite their superior performance, current RLHF approaches often require a large amount of human-labelled preference data, which is expensive to collect. In this paper, inspired by the success of active learning, we address this problem by proposing query-efficient RLHF methods. We first formalize the alignment problem as a contextual dueling bandit problem and design an active-query-based proximal policy optimization (APPO) algorithm with an $\tilde{O}(d^2/Δ)$ instance-dependent regret bound and an $\tilde{O}(d^2/Δ^2)$ query complexity, where $d$ is the dimension of feature space and $Δ$ is the sub-optimality gap over all the contexts. We then propose ADPO, a practical version of our algorithm based on direct preference optimization (DPO) and apply it to fine-tuning LLMs. Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of the state-of-the-art DPO method.

Reinforcement Learning from Human Feedback with Active Queries

TL;DR

<3-5 sentence high-level summary> Addressing the data efficiency challenge of RLHF, this paper reframes alignment as a contextual dueling bandit and introduces APPO, a query-efficient method with instance-dependent regret guarantees. It further provides ADPO, a practical DPO-based approach that leverages pseudo-labels and uncertainty-based querying to drastically reduce human-labels while maintaining strong performance. Theoretical results establish regret and query complexity for APPO, and experiments on Zephyr models show ADPO achieving comparable or superior results with about half the queries. Together, these contributions advance scalable, human-preference-aligned LLMs by reducing labeling costs without sacrificing alignment quality.

Abstract

Aligning large language models (LLM) with human preference plays a key role in building modern generative models and can be achieved by reinforcement learning from human feedback (RLHF). Despite their superior performance, current RLHF approaches often require a large amount of human-labelled preference data, which is expensive to collect. In this paper, inspired by the success of active learning, we address this problem by proposing query-efficient RLHF methods. We first formalize the alignment problem as a contextual dueling bandit problem and design an active-query-based proximal policy optimization (APPO) algorithm with an instance-dependent regret bound and an query complexity, where is the dimension of feature space and is the sub-optimality gap over all the contexts. We then propose ADPO, a practical version of our algorithm based on direct preference optimization (DPO) and apply it to fine-tuning LLMs. Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of the state-of-the-art DPO method.
Paper Structure (46 sections, 13 theorems, 66 equations, 3 figures, 5 tables, 2 algorithms)

This paper contains 46 sections, 13 theorems, 66 equations, 3 figures, 5 tables, 2 algorithms.

Key Result

Theorem 5.1

Let $\Delta$ be the minimal sub-optimal gap in Assumption assumption:gap. If we set the parameters $\Gamma= \widetilde{O}(\Delta/\sqrt{{d}})$, $\lambda=B^{-2}$, $\eta=\widetilde{O}(\sqrt{\Gamma^2 \log \mathcal{A} /d})$, and $\beta=\widetilde{O}(\sqrt{d}/\kappa_{\sigma})$ in Algorithm algo:po, then w In addition, the query complexity of Algorithm algo:po is upper bounded by:

Figures (3)

  • Figure 1: The test accuracy curve of DPO and $\text{ADPO}$ starting from Zephyr-Beta-SFT. The x-axis is the number of queries and the y-axis is the metric for corresponding dataset. Compared to DPO, $\text{ADPO}$ enjoys a faster performance improvement and a higher performance upper bound.
  • Figure 2: The test accuracy curve of DPO and $\text{ADPO}$ starting from Zephyr-Gemma-SFT. The x-axis is the number of queries and the y-axis is the metric for corresponding dataset. Compared to DPO, $\text{ADPO}$ enjoys a faster performance improvement and a higher performance upper bound.
  • Figure 3: The test accuracy curve of DPO, $\text{ADPO}$ (w/o PL) and $\text{ADPO}$ under LoRA-finetune. The x-axis is the number of queries and the y-axis is the metric for corresponding dataset. Compared to DPO and $\text{ADPO}$ (w/o PL), $\text{ADPO}$ enjoys a faster performance improvement and a higher performance upper bound.

Theorems & Definitions (17)

  • Definition 3.3: Minimal sub-optimality gap
  • Remark 3.5
  • Theorem 5.1
  • Remark 5.2
  • Remark 5.3
  • Lemma E.1: Modified from Lemma 4.5, zhang2023interplay
  • Lemma E.2
  • Lemma E.3
  • Lemma E.4
  • Lemma E.5: Modified from Lemma 6.2, he2022near
  • ...and 7 more