Table of Contents
Fetching ...

A Conceptual Introduction to Hamiltonian Monte Carlo

Michael Betancourt

TL;DR

This paper reframes Hamiltonian Monte Carlo (HMC) as a geometry-driven Markov transition method that leverages phase-space dynamics to efficiently explore the typical set of high-dimensional distributions. By introducing momentum and Hamiltonian dynamics, it explains how to design transitions that coherently traverse parameter space, and it details practical tuning via kinetic energy choices, integration-time strategies, and symplectic integrators with Metropolis corrections. The work highlights diagnostics and robustness results, argues for adaptive methods like Euclidean and Riemannian metric choices, and discusses termination criteria such as No-U-Turn to achieve dynamic, efficient sampling. Collectively, the approach enables scalable, principled Bayesian computation and underpins modern tools like Stan, with broad implications for high-dimensional statistical modeling.

Abstract

Hamiltonian Monte Carlo has proven a remarkable empirical success, but only recently have we begun to develop a rigorous understanding of why it performs so well on difficult problems and how it is best applied in practice. Unfortunately, that understanding is confined within the mathematics of differential geometry which has limited its dissemination, especially to the applied communities for which it is particularly important. In this review I provide a comprehensive conceptual account of these theoretical foundations, focusing on developing a principled intuition behind the method and its optimal implementations rather of any exhaustive rigor. Whether a practitioner or a statistician, the dedicated reader will acquire a solid grasp of how Hamiltonian Monte Carlo works, when it succeeds, and, perhaps most importantly, when it fails.

A Conceptual Introduction to Hamiltonian Monte Carlo

TL;DR

This paper reframes Hamiltonian Monte Carlo (HMC) as a geometry-driven Markov transition method that leverages phase-space dynamics to efficiently explore the typical set of high-dimensional distributions. By introducing momentum and Hamiltonian dynamics, it explains how to design transitions that coherently traverse parameter space, and it details practical tuning via kinetic energy choices, integration-time strategies, and symplectic integrators with Metropolis corrections. The work highlights diagnostics and robustness results, argues for adaptive methods like Euclidean and Riemannian metric choices, and discusses termination criteria such as No-U-Turn to achieve dynamic, efficient sampling. Collectively, the approach enables scalable, principled Bayesian computation and underpins modern tools like Stan, with broad implications for high-dimensional statistical modeling.

Abstract

Hamiltonian Monte Carlo has proven a remarkable empirical success, but only recently have we begun to develop a rigorous understanding of why it performs so well on difficult problems and how it is best applied in practice. Unfortunately, that understanding is confined within the mathematics of differential geometry which has limited its dissemination, especially to the applied communities for which it is particularly important. In this review I provide a comprehensive conceptual account of these theoretical foundations, focusing on developing a principled intuition behind the method and its optimal implementations rather of any exhaustive rigor. Whether a practitioner or a statistician, the dedicated reader will acquire a solid grasp of how Hamiltonian Monte Carlo works, when it succeeds, and, perhaps most importantly, when it fails.

Paper Structure

This paper contains 44 sections, 66 equations, 42 figures.

Figures (42)

  • Figure 1: To understand how the distribution of volume behaves with increasing dimension we can consider a rectangular partitioning centered around a distinguished point, such as the mode. (a) In one dimension the relative weight of the center partition is $1/3$, (b) in two dimensions it is $1/9$, (c) and in three dimensions it is only $1/27$. Very quickly the volume in the center partition becomes negligible compared to the neighboring volume.
  • Figure 2: The dominance of volume away from any point in parameter space can also be seen from a spherical perspective, where we consider the volume contained radial distance $\delta$ both interior to and exterior to a $D$-dimensional spherical shell, shown here with dashed lines. (a) In one dimension the spherical shell is a line and volumes interior and exterior are equivalent. (b) In two dimensions the spherical shell becomes circle and there is more volume immediately outside the shell than immediately inside. (c) The exterior volume grows even larger relative to the interior volume in three dimensions, where the spherical shell is now a the surface of a sphere. In fact, with increasing dimension the exterior volume grows exponentially large relative to the interior volume, and very quickly the volume around the mode is dwarfed by the volume away from the mode.
  • Figure 3: In high dimensions a probability density, $\pi \! \left( q \right)$, will concentrate around its mode, but the volume over which we integrate that density, $\mathrm{d}q$, is much larger away from the mode. Contributions to any expectation are determined by the product of density and volume, $\pi \! \left( q \right) \mathrm{d} q$, which then concentrates in a nearly-singular neighborhood called the typical set (grey).
  • Figure 4: In high-dimensional parameter spaces probability mass, $\pi \! \left( q \right) \mathrm{d} q$, and hence the dominant contributions to expectations, concentrates in a neighborhood called the typical set. In order to accurately estimate expectations we have to be able to identify where the typical set lies in parameter space so that we can focus our computational resources where they are most effective.
  • Figure 5: (a) A Markov chain is a sequence of points in parameter space generated by a Markov transition density (green) that defines the probability of a new point given the current point. (b) Sampling from that distribution yields a new state in the Markov chain and a new distribution from which to sample. (c) Repeating this process generates a Markov chain that meanders through parameter space.
  • ...and 37 more figures