Scalable Monte Carlo for Bayesian Learning

Paul Fearnhead; Christopher Nemeth; Chris J. Oates; Chris Sherlock

Scalable Monte Carlo for Bayesian Learning

Paul Fearnhead, Christopher Nemeth, Chris J. Oates, Chris Sherlock

TL;DR

The book develops scalable Monte Carlo methods for Bayesian learning by unifying traditional MCMC with stochastic-gradient, non-reversible, and continuous-time approaches. It shows how to leverage gradient information, subsampling, and PDMPs to scale sampling to large datasets and high-dimensional parameter spaces, while rigorously addressing bias, variance, and convergence. Key contributions include a detailed treatment of SGMCMC (SGLD/SGHMC/SGRLD), kernel-based tools for analysis, and practical guidance for tuning and assessing convergence using metrics like kernel Stein discrepancies. The frameworks and algorithms presented—ranging from MH-based schemes to PDMP-driven continuous-time samplers—offer scalable, flexible options for Bayesian inference in modern data-rich settings, with empirical demonstrations on logistic regression, Bayesian neural networks, and time-series models. Overall, the text provides a comprehensive roadmap for implementing fast, reliable Bayesian computation at scale, connecting theory to practice through both analytic results and real-data experiments.

Abstract

This book aims to provide a graduate-level introduction to advanced topics in Markov chain Monte Carlo (MCMC) algorithms, as applied broadly in the Bayesian computational context. Most, if not all of these topics (stochastic gradient MCMC, non-reversible MCMC, continuous time MCMC, and new techniques for convergence assessment) have emerged as recently as the last decade, and have driven substantial recent practical and theoretical advances in the field. A particular focus is on methods that are scalable with respect to either the amount of data, or the data dimension, motivated by the emerging high-priority application areas in machine learning and AI.

Scalable Monte Carlo for Bayesian Learning

TL;DR

Abstract

Paper Structure (126 sections, 13 theorems, 439 equations, 43 figures, 1 table, 11 algorithms)

This paper contains 126 sections, 13 theorems, 439 equations, 43 figures, 1 table, 11 algorithms.

Background
Monte Carlo Methods
What is Monte Carlo Integration?
Importance Sampling
Monte Carlo or Quadrature?
Control Variates
Monte Carlo Integration and Bayesian Statistics
Example Applications
Logistic Regression
Bayesian Matrix Factorisation
Bayesian Neural Networks for Classification
Markov Chains
Reversible Markov chains
Convergence, Averages, and Variances
Ergodic Averages
...and 111 more sections

Key Result

Lemma 3.1

Assume condition eq:ass-lipschitz-j, then there are constants $C_1,C_2>0$ where the pseudo variances of the simple gradient estimator eq:U-hat-simple and control variate-based gradient estimator eq:cv-estimator have the following bounds:

Figures (43)

Figure 1: Example of trapezoid rule. We can estimate the integral, by (i) setting $x_1,\dots,x_n$ to be evenly spaced points on $[0,1]$; (ii) creating $n-1$ trapezoids based on joining up the points $(x_k,h(x_k))$ (shaded in regions); and (iii) estimating the integral by the sum of the areas of the trapezoids.
Figure 2: Example of control variates for estimating $\mathbb{E} \left[{\sin(X)}\right]$, where $X$ has a standard normal distribution $\mathsf{N}(0,1)$. Each plot shows the function whose expectation is being estimated and 50 values used in the Monte Carlo estimate (dots). From left to right the functions are respectively: $h(x)=\sin(x)$, $h(x)=\sin(x)-x$, and $h(x)=\sin(x) - \pi x/2 + (x^2-1)/2$. The expectation of each function is constructed to be the same. The effect of introducing control variates in the middle and right-hand plot is to flatten out the function we are integrating -- in the middle plot, this happens for $x\approx0$ and for the right-hand plot for $x\approx \pi/2$. The variability of the function values, i.e. the dots, is smallest for the middle plot and largest for the right-hand plot.
Figure 3: 9-sided polygon where the Markov chain only moves clockwise (left figure), as in Example \ref{['example.ngon.nrev']} or moves either a clockwise or anti-clockwise direction with probability $1/3$ (right figure), as in Example \ref{['example.ngon.rev.np']}.
Figure 4: 9-sided polygon with Markov transitions described in Example \ref{['ex.ngon.periodic']}.
Figure 5: Three realisations of the Ornstein--Uhlenbeck processes, all with $\sigma=1$, and on the time interval $[0,10]$. Other parameter settings are $x_0=2$, $m=4$ and $b=3$; $x_0=m=0$ and $b=1$; $x_0=-2$, $m=-4$ and $b=1/3$.
...and 38 more figures

Theorems & Definitions (43)

Definition 1.1
Example 1.2
Example 1.3
Definition 1.4
Example 1.5
Example 1.6
Example 1.7
Example 1.8: Example \ref{['example.Ltwo.running']} continued
Example 1.9
Example 1.10
...and 33 more

Scalable Monte Carlo for Bayesian Learning

TL;DR

Abstract

Scalable Monte Carlo for Bayesian Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (43)

Theorems & Definitions (43)