Table of Contents
Fetching ...

Algorithmic warm starts for Hamiltonian Monte Carlo

Matthew S. Zhang, Jason M. Altschuler, Sinho Chewi

Abstract

Generating samples from a continuous probability density is a central algorithmic problem across statistics, engineering, and the sciences. For high-dimensional settings, Hamiltonian Monte Carlo (HMC) is the default algorithm across mainstream software packages. However, despite the extensive line of work on HMC and its widespread empirical success, it remains unclear how many iterations of HMC are required as a function of the dimension $d$. On one hand, a variety of results show that Metropolized HMC converges in $O(d^{1/4})$ iterations from a warm start close to stationarity. On the other hand, Metropolized HMC is significantly slower without a warm start, e.g., requiring $Ω(d^{1/2})$ iterations even for simple target distributions such as isotropic Gaussians. Finding a warm start is therefore the computational bottleneck for HMC. We resolve this issue for the well-studied setting of sampling from a probability distribution satisfying strong log-concavity (or isoperimetry) and third-order derivative bounds. We prove that \emph{non-Metropolized} HMC generates a warm start in $\tilde{O}(d^{1/4})$ iterations, after which we can exploit the warm start using Metropolized HMC. Our final complexity of $\tilde{O}(d^{1/4})$ is the fastest algorithm for high-accuracy sampling under these assumptions, improving over the prior best of $\tilde{O}(d^{1/2})$. This closes the long line of work on the dimensional complexity of MHMC for such settings, and also provides a simple warm-start prescription for practical implementations.

Algorithmic warm starts for Hamiltonian Monte Carlo

Abstract

Generating samples from a continuous probability density is a central algorithmic problem across statistics, engineering, and the sciences. For high-dimensional settings, Hamiltonian Monte Carlo (HMC) is the default algorithm across mainstream software packages. However, despite the extensive line of work on HMC and its widespread empirical success, it remains unclear how many iterations of HMC are required as a function of the dimension . On one hand, a variety of results show that Metropolized HMC converges in iterations from a warm start close to stationarity. On the other hand, Metropolized HMC is significantly slower without a warm start, e.g., requiring iterations even for simple target distributions such as isotropic Gaussians. Finding a warm start is therefore the computational bottleneck for HMC. We resolve this issue for the well-studied setting of sampling from a probability distribution satisfying strong log-concavity (or isoperimetry) and third-order derivative bounds. We prove that \emph{non-Metropolized} HMC generates a warm start in iterations, after which we can exploit the warm start using Metropolized HMC. Our final complexity of is the fastest algorithm for high-accuracy sampling under these assumptions, improving over the prior best of . This closes the long line of work on the dimensional complexity of MHMC for such settings, and also provides a simple warm-start prescription for practical implementations.
Paper Structure (45 sections, 38 theorems, 254 equations, 4 figures)

This paper contains 45 sections, 38 theorems, 254 equations, 4 figures.

Key Result

Theorem 1.1

Consider a target distribution $\pi \propto \exp(-V)$ on $\mathbb{R}^d$, where $V$ is strongly convex, smooth, and has Frobenius-Lipschitz Hessian. There is an algorithm that uses $O(d^{1/4} \log^2 1/\varepsilon)$ first-order queries to produce a sample from a distribution $\mu$ where $\chi^2(\mu \m

Figures (4)

  • Figure 1: The convergence of MHMC is heavily dependent on the step size $h$. Large step sizes $h \asymp d^{-1/4}$ classically lead to fast convergence from a warm start, but can get stuck in cold starts due to very low acceptance probability (left). Small step sizes $h \asymp d^{-1/2}$ fix that issue but lead to slow movement, requiring at least $1/h \asymp d^{1/2}$ steps to traverse the space, let alone mix (right). Illustrated for the simple target $\pi = \mathcal{N}(0, I)$ in dimension $d = 10^4$, with "cold start" initialization at the mode $x_0 = 0$. Reproducibility details: MHMC is repeatedly integrated for $T=1$ unit of continuous time via $1/h$ leapfrog steps of size $h$, as is standard. Similar qualitative phenomena are observed for other settings.
  • Figure 2: A key algorithmic insight is that unadjusted HMC rapidly escapes cold starts using large step sizes $h \asymp d^{-1/4}$. This algorithm quickly leads to iterates which, if used as an initialization for MHMC, would have high acceptance probability (right). This motivates our two-phase algorithmic proposal: escape the cold start via unadjusted HMC, then exploit the warm start using MHMC. Removing the Metropolis filter bypasses the issue of low acceptance probability at cold starts (left) without requiring the use of small step sizes which result in slow movement (middle).
  • Figure 3: Schematic diagram of the one-shot coupling in the proof of Theorem \ref{['thm:regularity']}. Top and bottom (in black): the two OHO processes $(\bar{x}_0,\bar{p}_0) \mapsto (\bar{X}_h^\mathsf{OHO},\bar{P}_h^{\mathsf{OHO}})$ and $(x_0,p_0) \mapsto (X_h^\mathsf{OHO},P_h^{\mathsf{OHO}})$ are coupled to use the same Gaussian noise increments $\bar{B}_1, \bar{B}_2$. Diagonal (in blue): different noise increments $B_1,B_2$ enable an auxiliary OHO process $(\bar{x}_0,\bar{p}_0) \mapsto (X_h^\mathsf{OHO},P_h^{\mathsf{OHO}})$ that starts from one process and ends at the other. To achieve this interpolation, the auxiliary noise increments $B_1,B_2$ are uniquely determined as a function of $x_0,p_0,\bar{x}_0,\bar{p}_0,\bar{B_1},\bar{B_2}$. Orange: $B_1$ in the first "O" step is uniquely determined so that the resulting momentum $P_h^{{\operatorname{target}}}$ will enable the correct $x$-coordinate $X_h^\mathsf{OH}$ after the "H" step. $B_2$ is then uniquely determined so that the $p$-coordinate $P_h^\mathsf{OHO}$ will match after the final "O" step.
  • Figure 4: Schematic diagram for the proof of Theorem \ref{['thm:harnack']}. Top and bottom (in black): the two processes $(\bar{X}_{kh}, \bar{P}_{kh})$ and $(X_{kh},P_{kh})$ that we seek to compare. Each horizontal arrow is an OHO step. Diagonal (in blue): the auxiliary process $(X_{kh}^{\mathsf{aux}},P_{kh}^{\mathsf{aux}})$ we construct to apply the shifted composition rule. The auxiliary process interpolates between the original two processes at initialization (top left) and termination (bottom right). The auxiliary process updates by first shifting the momentum (grey, downwards arrow) and then performing an OHO step (black, right arrow).

Theorems & Definitions (74)

  • Theorem 1.1: Informal statement of Theorem \ref{['thm:main-slc']}
  • Theorem 1.2: Informal statement of Theorem \ref{['thm:main-warm']}
  • Remark 1.3: Regularity
  • Remark 1.4: Condition number
  • Definition 2.1: Rényi divergence
  • Proposition 2.2: Properties of Rényi divergences
  • Theorem 2.3: Shifted composition rule
  • Definition 2.5
  • Lemma 3.2: Gaussian chaos bound
  • Remark 3.3
  • ...and 64 more