Table of Contents
Fetching ...

Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling

Carlos Riquelme, George Tucker, Jasper Snoek

TL;DR

The paper benchmarks a broad set of approximate Bayesian neural network posteriors within Thompson Sampling for contextual bandits, revealing that methods successful in supervised settings often falter online due to slow or misaligned uncertainty estimates. A key finding is that decoupling representation learning from uncertainty estimation (e.g., Neural Linear) yields robust, easy-to-tune performance, while more integrated approaches (VI, EP, BBB) can suffer from partial optimization in online settings. The Wheel Bandit and real-world datasets show that explicit, scalable uncertainty handling and fast adaptation are crucial for effective exploration. Overall, the work provides practical guidance and an open benchmark for evaluating Bayesian deep learning methods in online decision-making contexts.

Abstract

Recent advances in deep reinforcement learning have made significant strides in performance on applications such as Go and Atari games. However, developing practical methods to balance exploration and exploitation in complex domains remains largely unsolved. Thompson Sampling and its extension to reinforcement learning provide an elegant approach to exploration that only requires access to posterior samples of the model. At the same time, advances in approximate Bayesian methods have made posterior approximation for flexible neural network models practical. Thus, it is attractive to consider approximate Bayesian neural networks in a Thompson Sampling framework. To understand the impact of using an approximate posterior on Thompson Sampling, we benchmark well-established and recently developed methods for approximate posterior sampling combined with Thompson Sampling over a series of contextual bandit problems. We found that many approaches that have been successful in the supervised learning setting underperformed in the sequential decision-making scenario. In particular, we highlight the challenge of adapting slowly converging uncertainty estimates to the online setting.

Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling

TL;DR

The paper benchmarks a broad set of approximate Bayesian neural network posteriors within Thompson Sampling for contextual bandits, revealing that methods successful in supervised settings often falter online due to slow or misaligned uncertainty estimates. A key finding is that decoupling representation learning from uncertainty estimation (e.g., Neural Linear) yields robust, easy-to-tune performance, while more integrated approaches (VI, EP, BBB) can suffer from partial optimization in online settings. The Wheel Bandit and real-world datasets show that explicit, scalable uncertainty handling and fast adaptation are crucial for effective exploration. Overall, the work provides practical guidance and an open benchmark for evaluating Bayesian deep learning methods in online decision-making contexts.

Abstract

Recent advances in deep reinforcement learning have made significant strides in performance on applications such as Go and Atari games. However, developing practical methods to balance exploration and exploitation in complex domains remains largely unsolved. Thompson Sampling and its extension to reinforcement learning provide an elegant approach to exploration that only requires access to posterior samples of the model. At the same time, advances in approximate Bayesian methods have made posterior approximation for flexible neural network models practical. Thus, it is attractive to consider approximate Bayesian neural networks in a Thompson Sampling framework. To understand the impact of using an approximate posterior on Thompson Sampling, we benchmark well-established and recently developed methods for approximate posterior sampling combined with Thompson Sampling over a series of contextual bandit problems. We found that many approaches that have been successful in the supervised learning setting underperformed in the sequential decision-making scenario. In particular, we highlight the challenge of adapting slowly converging uncertainty estimates to the online setting.

Paper Structure

This paper contains 12 sections, 3 equations, 6 figures, 10 tables, 1 algorithm.

Figures (6)

  • Figure 1: Visualizations of the posterior approximations in a linear example.
  • Figure 2: The impact on regret of different approximated posteriors. We show (green) the actual linear posterior, (orange) the diagonal posterior approximation and (blue) the precision approximation in \ref{['fig:figure4']}. In \ref{['fig:figure5']} and \ref{['fig:figure6']} we visualize the impact of the approximations on cumulative regret.
  • Figure 3: Wheel bandits for increasing values of $\delta \in (0, 1)$. Optimal action for blue, red, green, black, and yellow regions, are actions 1, 2, 3, 4, and 5, respectively.
  • Figure 4: Cumulative regret for Bayes By Backprop (Variational Inference, fixed noise $\sigma = 0.75$) applied to a linear model and an exact mean field solution, denoted PrecisionDiag, with a linear bandit (left) and with the Statlog bandit (right). The suffix of the BBB legend label indicates the number of training epochs in each training step. We emphasize that in this evaluation, all algorithms use the same family of models (i.e., linear). While PrecisionDiag exactly solves the mean field problem, BBB relies on partial optimization via SGD. As the number of training epochs increases, BBB improves performance, but is always outperformed by PrecisionDiag.
  • Figure 5: A boxplot of the ranks achieved by each algorithm across the suite of benchmarks. The red and black solid lines respectively indicate the median and mean rank across problems.
  • ...and 1 more figures