Efficient Exploration for LLMs

Vikranth Dwaracherla; Seyed Mohammad Asghari; Botao Hao; Benjamin Van Roy

Efficient Exploration for LLMs

Vikranth Dwaracherla, Seyed Mohammad Asghari, Botao Hao, Benjamin Van Roy

TL;DR

The paper demonstrates that active exploration significantly boosts data efficiency in RLHF for large language models by using double Thompson sampling with epistemic uncertainty to guide query selection. It introduces an experimentation pipeline with a human-feedback simulator and compares passive, Boltzmann, infomax, and ENN-based strategies, showing that double TS achieves the best performance with far fewer queries. Key findings include the critical role of uncertainty estimation, the superior long-horizon performance of double TS, and the scalability benefits, suggesting decades-accelerating potential for superhuman creativity with efficient feedback. The work also discusses practical limitations and future directions such as stronger ENN architectures and multiturn dialog exploration.

Abstract

We present evidence of substantial benefit from efficient exploration in gathering human feedback to improve large language models. In our experiments, an agent sequentially generates queries while fitting a reward model to the feedback received. Our best-performing agent generates queries using double Thompson sampling, with uncertainty represented by an epistemic neural network. Our results demonstrate that efficient exploration enables high levels of performance with far fewer queries. Further, both uncertainty estimation and the choice of exploration scheme play critical roles.

Efficient Exploration for LLMs

TL;DR

Abstract

Paper Structure (21 sections, 4 equations, 11 figures, 7 algorithms)

This paper contains 21 sections, 4 equations, 11 figures, 7 algorithms.

Introduction
Experimentation Pipeline
Reward Model Architectures and Training
Point Estimate
Epistemic Neural Network
Training
Exploration Algorithms
Passive Exploration
Active Exploration with a Point Estimate
Active Exploration with an ENN
Empirical Results
Assessment of Exploration Algorithms
Scaling with the Volume of Feedback
Quality of Uncertainty Estimates
The Life of a Prompt
...and 6 more sections

Figures (11)

Figure 1: Queries required by double TS versus alternatives to attain various levels of performance.
Figure 2: The sequential querying and learning pipeline.
Figure 3: The performance assessment pipeline.
Figure 4: Our reward models take as input the last-layer embedding of the Gemini Nano language model. A stop gradient prevents torso updating of torso weights.
Figure 5: Performance with passive, Boltzmann, infomax and double TS exploration algorithms. We can see that active exploration leads to much better levels of performance with the same amount of data. double TS exploration scheme leads to the best level of performance.
...and 6 more figures

Efficient Exploration for LLMs

TL;DR

Abstract

Efficient Exploration for LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (11)