Table of Contents
Fetching ...

Bootstrapped Thompson Sampling and Deep Exploration

Ian Osband, Benjamin Van Roy

TL;DR

The paper tackles exploration in sequential decision tasks where deep, nonlinear models make posterior sampling impractical. It introduces a bootstrap-based surrogate for Thompson sampling that enriches data with artificial samples to induce priors, enabling efficient exploration in both multi-armed bandits and reinforcement learning. The authors formalize BootstrapThompson, analyze when artificial data ensures effective exploration, and extend the approach to deep RL via bootstrapped value function randomization with incremental, parallelizable variants. This yields a scalable framework for deep exploration that integrates bootstrap methods with Bayesian-inspired intuition in complex environments.

Abstract

This technical note presents a new approach to carrying out the kind of exploration achieved by Thompson sampling, but without explicitly maintaining or sampling from posterior distributions. The approach is based on a bootstrap technique that uses a combination of observed and artificially generated data. The latter serves to induce a prior distribution which, as we will demonstrate, is critical to effective exploration. We explain how the approach can be applied to multi-armed bandit and reinforcement learning problems and how it relates to Thompson sampling. The approach is particularly well-suited for contexts in which exploration is coupled with deep learning, since in these settings, maintaining or generating samples from a posterior distribution becomes computationally infeasible.

Bootstrapped Thompson Sampling and Deep Exploration

TL;DR

The paper tackles exploration in sequential decision tasks where deep, nonlinear models make posterior sampling impractical. It introduces a bootstrap-based surrogate for Thompson sampling that enriches data with artificial samples to induce priors, enabling efficient exploration in both multi-armed bandits and reinforcement learning. The authors formalize BootstrapThompson, analyze when artificial data ensures effective exploration, and extend the approach to deep RL via bootstrapped value function randomization with incremental, parallelizable variants. This yields a scalable framework for deep exploration that integrates bootstrap methods with Bayesian-inspired intuition in complex environments.

Abstract

This technical note presents a new approach to carrying out the kind of exploration achieved by Thompson sampling, but without explicitly maintaining or sampling from posterior distributions. The approach is based on a bootstrap technique that uses a combination of observed and artificially generated data. The latter serves to induce a prior distribution which, as we will demonstrate, is critical to effective exploration. We explain how the approach can be applied to multi-armed bandit and reinforcement learning problems and how it relates to Thompson sampling. The approach is particularly well-suited for contexts in which exploration is coupled with deep learning, since in these settings, maintaining or generating samples from a posterior distribution becomes computationally infeasible.

Paper Structure

This paper contains 6 sections, 2 equations, 1 figure, 6 algorithms.

Figures (1)

  • Figure 1: Cumulative regret of BootstrapThompson using different bootstrap methods (lower is better). Artificial prior data helps to drive efficient exploration.