Algorithmic warm starts for Hamiltonian Monte Carlo

Matthew S. Zhang; Jason M. Altschuler; Sinho Chewi

Algorithmic warm starts for Hamiltonian Monte Carlo

Matthew S. Zhang, Jason M. Altschuler, Sinho Chewi

Abstract

Generating samples from a continuous probability density is a central algorithmic problem across statistics, engineering, and the sciences. For high-dimensional settings, Hamiltonian Monte Carlo (HMC) is the default algorithm across mainstream software packages. However, despite the extensive line of work on HMC and its widespread empirical success, it remains unclear how many iterations of HMC are required as a function of the dimension $d$. On one hand, a variety of results show that Metropolized HMC converges in $O(d^{1/4})$ iterations from a warm start close to stationarity. On the other hand, Metropolized HMC is significantly slower without a warm start, e.g., requiring $Ω(d^{1/2})$ iterations even for simple target distributions such as isotropic Gaussians. Finding a warm start is therefore the computational bottleneck for HMC. We resolve this issue for the well-studied setting of sampling from a probability distribution satisfying strong log-concavity (or isoperimetry) and third-order derivative bounds. We prove that \emph{non-Metropolized} HMC generates a warm start in $\tilde{O}(d^{1/4})$ iterations, after which we can exploit the warm start using Metropolized HMC. Our final complexity of $\tilde{O}(d^{1/4})$ is the fastest algorithm for high-accuracy sampling under these assumptions, improving over the prior best of $\tilde{O}(d^{1/2})$. This closes the long line of work on the dimensional complexity of MHMC for such settings, and also provides a simple warm-start prescription for practical implementations.

Algorithmic warm starts for Hamiltonian Monte Carlo

Abstract

. On one hand, a variety of results show that Metropolized HMC converges in

iterations from a warm start close to stationarity. On the other hand, Metropolized HMC is significantly slower without a warm start, e.g., requiring

iterations even for simple target distributions such as isotropic Gaussians. Finding a warm start is therefore the computational bottleneck for HMC. We resolve this issue for the well-studied setting of sampling from a probability distribution satisfying strong log-concavity (or isoperimetry) and third-order derivative bounds. We prove that \emph{non-Metropolized} HMC generates a warm start in

iterations, after which we can exploit the warm start using Metropolized HMC. Our final complexity of

is the fastest algorithm for high-accuracy sampling under these assumptions, improving over the prior best of

. This closes the long line of work on the dimensional complexity of MHMC for such settings, and also provides a simple warm-start prescription for practical implementations.

Paper Structure (45 sections, 38 theorems, 254 equations, 4 figures)

This paper contains 45 sections, 38 theorems, 254 equations, 4 figures.

Introduction
Contributions
Related work
Background
(Metropolized) Hamiltonian Monte Carlo
Divergences between probability measures
Functional inequalities
Main result: high-accuracy sampling in d(1/4) steps
Formal statement of result
Extension to other settings via the proximal sampler
Proof of high-accuracy sampling guarantees using the warm start
Proof of Theorem \ref{['thm:main-slc']}
Proof of Corollaries \ref{['cor:prox-iso']} and \ref{['cor:prox-wc']}
LSI case.
Poincaré case.
...and 30 more sections

Key Result

Theorem 1.1

Consider a target distribution $\pi \propto \exp(-V)$ on $\mathbb{R}^d$, where $V$ is strongly convex, smooth, and has Frobenius-Lipschitz Hessian. There is an algorithm that uses $O(d^{1/4} \log^2 1/\varepsilon)$ first-order queries to produce a sample from a distribution $\mu$ where $\chi^2(\mu \m

Figures (4)

Figure 1: The convergence of MHMC is heavily dependent on the step size $h$. Large step sizes $h \asymp d^{-1/4}$ classically lead to fast convergence from a warm start, but can get stuck in cold starts due to very low acceptance probability (left). Small step sizes $h \asymp d^{-1/2}$ fix that issue but lead to slow movement, requiring at least $1/h \asymp d^{1/2}$ steps to traverse the space, let alone mix (right). Illustrated for the simple target $\pi = \mathcal{N}(0, I)$ in dimension $d = 10^4$, with "cold start" initialization at the mode $x_0 = 0$. Reproducibility details: MHMC is repeatedly integrated for $T=1$ unit of continuous time via $1/h$ leapfrog steps of size $h$, as is standard. Similar qualitative phenomena are observed for other settings.
Figure 2: A key algorithmic insight is that unadjusted HMC rapidly escapes cold starts using large step sizes $h \asymp d^{-1/4}$. This algorithm quickly leads to iterates which, if used as an initialization for MHMC, would have high acceptance probability (right). This motivates our two-phase algorithmic proposal: escape the cold start via unadjusted HMC, then exploit the warm start using MHMC. Removing the Metropolis filter bypasses the issue of low acceptance probability at cold starts (left) without requiring the use of small step sizes which result in slow movement (middle).
Figure 3: Schematic diagram of the one-shot coupling in the proof of Theorem \ref{['thm:regularity']}. Top and bottom (in black): the two OHO processes $(\bar{x}_0,\bar{p}_0) \mapsto (\bar{X}_h^\mathsf{OHO},\bar{P}_h^{\mathsf{OHO}})$ and $(x_0,p_0) \mapsto (X_h^\mathsf{OHO},P_h^{\mathsf{OHO}})$ are coupled to use the same Gaussian noise increments $\bar{B}_1, \bar{B}_2$. Diagonal (in blue): different noise increments $B_1,B_2$ enable an auxiliary OHO process $(\bar{x}_0,\bar{p}_0) \mapsto (X_h^\mathsf{OHO},P_h^{\mathsf{OHO}})$ that starts from one process and ends at the other. To achieve this interpolation, the auxiliary noise increments $B_1,B_2$ are uniquely determined as a function of $x_0,p_0,\bar{x}_0,\bar{p}_0,\bar{B_1},\bar{B_2}$. Orange: $B_1$ in the first "O" step is uniquely determined so that the resulting momentum $P_h^{{\operatorname{target}}}$ will enable the correct $x$-coordinate $X_h^\mathsf{OH}$ after the "H" step. $B_2$ is then uniquely determined so that the $p$-coordinate $P_h^\mathsf{OHO}$ will match after the final "O" step.
Figure 4: Schematic diagram for the proof of Theorem \ref{['thm:harnack']}. Top and bottom (in black): the two processes $(\bar{X}_{kh}, \bar{P}_{kh})$ and $(X_{kh},P_{kh})$ that we seek to compare. Each horizontal arrow is an OHO step. Diagonal (in blue): the auxiliary process $(X_{kh}^{\mathsf{aux}},P_{kh}^{\mathsf{aux}})$ we construct to apply the shifted composition rule. The auxiliary process interpolates between the original two processes at initialization (top left) and termination (bottom right). The auxiliary process updates by first shifting the momentum (grey, downwards arrow) and then performing an OHO step (black, right arrow).

Theorems & Definitions (74)

Theorem 1.1: Informal statement of Theorem \ref{['thm:main-slc']}
Theorem 1.2: Informal statement of Theorem \ref{['thm:main-warm']}
Remark 1.3: Regularity
Remark 1.4: Condition number
Definition 2.1: Rényi divergence
Proposition 2.2: Properties of Rényi divergences
Theorem 2.3: Shifted composition rule
Definition 2.5
Lemma 3.2: Gaussian chaos bound
Remark 3.3
...and 64 more

Algorithmic warm starts for Hamiltonian Monte Carlo

Abstract

Algorithmic warm starts for Hamiltonian Monte Carlo

Authors

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (74)