Table of Contents
Fetching ...

Efficient Exploration at Scale

Seyed Mohammad Asghari, Chris Chute, Vikranth Dwaracherla, Xiuyuan Lu, Mehdi Jafarnia, Victor Minden, Zheng Wen, Benjamin Van Roy

Abstract

We develop an online learning algorithm that dramatically improves the data efficiency of reinforcement learning from human feedback (RLHF). Our algorithm incrementally updates reward and language models as choice data is received. The reward model is fit to the choice data, while the language model is updated by a variation of reinforce, with reinforcement signals provided by the reward model. Several features enable the efficiency gains: a small affirmative nudge added to each reinforcement signal, an epistemic neural network that models reward uncertainty, and information-directed exploration. With Gemma large language models (LLMs), our algorithm matches the performance of offline RLHF trained on 200K labels using fewer than 20K labels, representing more than a 10x gain in data efficiency. Extrapolating from our results, we expect our algorithm trained on 1M labels to match offline RLHF trained on 1B labels. This represents a 1,000x gain. To our knowledge, these are the first results to demonstrate that such large improvements are possible.

Efficient Exploration at Scale

Abstract

We develop an online learning algorithm that dramatically improves the data efficiency of reinforcement learning from human feedback (RLHF). Our algorithm incrementally updates reward and language models as choice data is received. The reward model is fit to the choice data, while the language model is updated by a variation of reinforce, with reinforcement signals provided by the reward model. Several features enable the efficiency gains: a small affirmative nudge added to each reinforcement signal, an epistemic neural network that models reward uncertainty, and information-directed exploration. With Gemma large language models (LLMs), our algorithm matches the performance of offline RLHF trained on 200K labels using fewer than 20K labels, representing more than a 10x gain in data efficiency. Extrapolating from our results, we expect our algorithm trained on 1M labels to match offline RLHF trained on 1B labels. This represents a 1,000x gain. To our knowledge, these are the first results to demonstrate that such large improvements are possible.
Paper Structure (23 sections, 7 equations, 9 figures)

This paper contains 23 sections, 7 equations, 9 figures.

Figures (9)

  • Figure 1: The plots are of performance, in terms of the win rate over a baseline policy, as functions of the amount of human feedback, in terms of the number of choices observed. Efficient exploration shifts the scaling law.
  • Figure 2: Feedback in the form of a choice between two responses produced is used to improve the policy.
  • Figure 3: Given responses generated by competing and baseline policies, the human feedback simulator produces a preference probability.
  • Figure 4: Strong performance of online RLHF relies on a reward function (left) and an affirmative nudge (right).
  • Figure 5: A neural network reward model versus an epistemic neural network reward model.
  • ...and 4 more figures