Efficient Exploration for LLMs
Vikranth Dwaracherla, Seyed Mohammad Asghari, Botao Hao, Benjamin Van Roy
TL;DR
The paper demonstrates that active exploration significantly boosts data efficiency in RLHF for large language models by using double Thompson sampling with epistemic uncertainty to guide query selection. It introduces an experimentation pipeline with a human-feedback simulator and compares passive, Boltzmann, infomax, and ENN-based strategies, showing that double TS achieves the best performance with far fewer queries. Key findings include the critical role of uncertainty estimation, the superior long-horizon performance of double TS, and the scalability benefits, suggesting decades-accelerating potential for superhuman creativity with efficient feedback. The work also discusses practical limitations and future directions such as stronger ENN architectures and multiturn dialog exploration.
Abstract
We present evidence of substantial benefit from efficient exploration in gathering human feedback to improve large language models. In our experiments, an agent sequentially generates queries while fitting a reward model to the feedback received. Our best-performing agent generates queries using double Thompson sampling, with uncertainty represented by an epistemic neural network. Our results demonstrate that efficient exploration enables high levels of performance with far fewer queries. Further, both uncertainty estimation and the choice of exploration scheme play critical roles.
