Table of Contents
Fetching ...

A Tutorial on Thompson Sampling

Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen

TL;DR

Thompson sampling reframes online decision making as Bayesian posterior sampling over unknown parameters and selects actions by maximizing expected reward under a freshly drawn plausible model. The tutorial unifies exact Bayesian updates (as in Beta-Bernoulli and log-Gaussian shortest-path settings) with practical approximations (Gibbs, Laplace, Langevin, bootstrap) and discusses prior design, nonstationarity, and concurrency. It presents a broad set of applications—from Bernoulli bandits to news recommender systems and reinforcement learning—illustrating the method's versatility and the conditions under which it excels or falters. The chapter also surveys regret analyses, information-theoretic bounds, and alternative exploration strategies, offering both practical guidance and theoretical grounding for applying TS to complex, structured online decision problems.

Abstract

Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance. The algorithm addresses a broad range of problems in a computationally efficient manner and is therefore enjoying wide use. This tutorial covers the algorithm and its application, illustrating concepts through a range of examples, including Bernoulli bandit problems, shortest path problems, product recommendation, assortment, active learning with neural networks, and reinforcement learning in Markov decision processes. Most of these problems involve complex information structures, where information revealed by taking an action informs beliefs about other actions. We will also discuss when and why Thompson sampling is or is not effective and relations to alternative algorithms.

A Tutorial on Thompson Sampling

TL;DR

Thompson sampling reframes online decision making as Bayesian posterior sampling over unknown parameters and selects actions by maximizing expected reward under a freshly drawn plausible model. The tutorial unifies exact Bayesian updates (as in Beta-Bernoulli and log-Gaussian shortest-path settings) with practical approximations (Gibbs, Laplace, Langevin, bootstrap) and discusses prior design, nonstationarity, and concurrency. It presents a broad set of applications—from Bernoulli bandits to news recommender systems and reinforcement learning—illustrating the method's versatility and the conditions under which it excels or falters. The chapter also surveys regret analyses, information-theoretic bounds, and alternative exploration strategies, offering both practical guidance and theoretical grounding for applying TS to complex, structured online decision problems.

Abstract

Thompson sampling is an algorithm for online decision problems where actions are taken sequentially in a manner that must balance between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance. The algorithm addresses a broad range of problems in a computationally efficient manner and is therefore enjoying wide use. This tutorial covers the algorithm and its application, illustrating concepts through a range of examples, including Bernoulli bandit problems, shortest path problems, product recommendation, assortment, active learning with neural networks, and reinforcement learning in Markov decision processes. Most of these problems involve complex information structures, where information revealed by taking an action informs beliefs about other actions. We will also discuss when and why Thompson sampling is or is not effective and relations to alternative algorithms.

Paper Structure

This paper contains 39 sections, 62 equations, 21 figures, 6 algorithms.

Figures (21)

  • Figure 1: Shortest path problem.
  • Figure 2: Online decision algorithm.
  • Figure 3: Probability density functions over mean rewards.
  • Figure 4: Probability that the greedy algorithm and Thompson sampling selects an action.
  • Figure 5: Regret from applying greedy and Thompson sampling algorithms to the three-armed Bernoulli bandit.
  • ...and 16 more figures

Theorems & Definitions (14)

  • Example 1.1
  • Example 1.2
  • Example 3.1
  • Example 4.1
  • Example 4.2
  • Example 5.1
  • Example 7.1
  • Example 7.2
  • Example 8.1
  • Example 8.2
  • ...and 4 more