When does Metropolized Hamiltonian Monte Carlo provably outperform Metropolis-adjusted Langevin algorithm?

Yuansi Chen; Khashayar Gatmiry; Minhui Jiang

When does Metropolized Hamiltonian Monte Carlo provably outperform Metropolis-adjusted Langevin algorithm?

Yuansi Chen, Khashayar Gatmiry, Minhui Jiang

TL;DR

This work analyzes Metropolized Hamiltonian Monte Carlo (HMC) with leapfrog integration for sampling from smooth densities on $\mathbb{R}^d$ under Cheeger-type isoperimetric conditions and Lipschitz Hessian in Frobenius norm, establishing gradient complexity bounds of $\tilde{O}(d^{1/4}\,\text{polylog}(1/\varepsilon))$ from a warm start. A key novelty is proving that the joint distribution of the discretized location-velocity pair remains approximately invariant across leapfrog steps, which, via induction, yields sharp control over acceptance rates and transition overlaps. The main theorem provides a mixing-time bound $\tau_{mix}^\text{HMC}(\varepsilon) = O\left( \frac{1}{K^2\eta^2\psi_\mu^2} \log\left(\frac{M}{\varepsilon}\right) \right)$ and a gradient complexity $O\left( \frac{1}{K\eta^2\psi_\mu^2} \log\left(\frac{M}{\varepsilon}\right) \right)$ under $L$-smoothness, $\gamma L^{3/2}$-strong Hessian Lipschitz, and isoperimetric coefficient $\psi_\mu$. With optimal choices $K \asymp d^{1/4}$ and $\eta \asymp L^{-1}d^{-1/4}$ (for constant $\gamma$), HMC achieves the $d^{1/4}$-dimension scaling in mixing time (and near-identical scaling in gradient complexity), improving upon MALA's $d^{3/7}$ in the same regime. The paper also provides practical examples—ridge-separable functions and two-layer neural networks—that satisfy the assumptions and illustrate the regimes where $K>1$ yields tangible benefits.

Abstract

We analyze the mixing time of Metropolized Hamiltonian Monte Carlo (HMC) with the leapfrog integrator to sample from a distribution on $\mathbb{R}^d$ whose log-density is smooth, has Lipschitz Hessian in Frobenius norm and satisfies isoperimetry. We bound the gradient complexity to reach $ε$ error in total variation distance from a warm start by $\tilde O(d^{1/4}\text{polylog}(1/ε))$ and demonstrate the benefit of choosing the number of leapfrog steps to be larger than 1. To surpass the previous analysis on Metropolis-adjusted Langevin algorithm (MALA) that has $\tilde{O}(d^{1/2}\text{polylog}(1/ε))$ dimension dependency [WSC22], we reveal a key feature in our proof that the joint distribution of the location and velocity variables of the discretization of the continuous HMC dynamics stays approximately invariant. This key feature, when shown via induction over the number of leapfrog steps, enables us to obtain estimates on moments of various quantities that appear in the acceptance rate control of Metropolized HMC. Notably, our analysis does not require log-concavity or independence of the marginals, and only relies on an isoperimetric inequality. To illustrate the relevance of the Lipschitz Hessian in Frobenius norm assumption, several examples that fall into our framework are discussed.

When does Metropolized Hamiltonian Monte Carlo provably outperform Metropolis-adjusted Langevin algorithm?

TL;DR

This work analyzes Metropolized Hamiltonian Monte Carlo (HMC) with leapfrog integration for sampling from smooth densities on

under Cheeger-type isoperimetric conditions and Lipschitz Hessian in Frobenius norm, establishing gradient complexity bounds of

from a warm start. A key novelty is proving that the joint distribution of the discretized location-velocity pair remains approximately invariant across leapfrog steps, which, via induction, yields sharp control over acceptance rates and transition overlaps. The main theorem provides a mixing-time bound

and a gradient complexity

under

-smoothness,

-strong Hessian Lipschitz, and isoperimetric coefficient

. With optimal choices

and

(for constant

), HMC achieves the

-dimension scaling in mixing time (and near-identical scaling in gradient complexity), improving upon MALA's

in the same regime. The paper also provides practical examples—ridge-separable functions and two-layer neural networks—that satisfy the assumptions and illustrate the regimes where

yields tangible benefits.

Abstract

We analyze the mixing time of Metropolized Hamiltonian Monte Carlo (HMC) with the leapfrog integrator to sample from a distribution on

whose log-density is smooth, has Lipschitz Hessian in Frobenius norm and satisfies isoperimetry. We bound the gradient complexity to reach

error in total variation distance from a warm start by

and demonstrate the benefit of choosing the number of leapfrog steps to be larger than 1. To surpass the previous analysis on Metropolis-adjusted Langevin algorithm (MALA) that has

dimension dependency [WSC22], we reveal a key feature in our proof that the joint distribution of the location and velocity variables of the discretization of the continuous HMC dynamics stays approximately invariant. This key feature, when shown via induction over the number of leapfrog steps, enables us to obtain estimates on moments of various quantities that appear in the acceptance rate control of Metropolized HMC. Notably, our analysis does not require log-concavity or independence of the marginals, and only relies on an isoperimetric inequality. To illustrate the relevance of the Lipschitz Hessian in Frobenius norm assumption, several examples that fall into our framework are discussed.

Paper Structure (37 sections, 25 theorems, 186 equations, 1 table)

This paper contains 37 sections, 25 theorems, 186 equations, 1 table.

Introduction
Related work
Our contribution
Preliminaries
Markov chain basics
$s$-conductance.
Lazy chain.
Total variation distance.
Warm start.
Mixing time.
HMC basics
Continuous HMC dynamics.
Metropolized HMC with leapfrog integrator.
Notation
Regularity properties of a target density
...and 22 more sections

Key Result

Theorem 1

Let $\mu\propto e^{-f}$ be a target density on $\mathbb{R}^d$ that satisfies Assumption ass:assumption_main. For any error tolerance $\epsilon \in (0, 1)$, from any $M$-warm initial measure $\mu_0$, if the HMC parameter choices are such that where $\ell \geq 2\left\lceil c' \log\left( \max\left\{ 1,\frac{1}{K\eta\psi_\mu} \right \}\frac{M}{\epsilon} \right) \right \rceil$, $d_\ell = d + 2(\ell-1)

Theorems & Definitions (43)

Theorem 1
Corollary 1: Best HMC
Corollary 2: MALA
Lemma 1: Lovasz and Simonovits lovasz1993random
Lemma 2: Proposal overlap
Lemma 3: Acceptance rate control
Lemma 4
Lemma 5
Lemma 6
Lemma 7
...and 33 more

When does Metropolized Hamiltonian Monte Carlo provably outperform Metropolis-adjusted Langevin algorithm?

TL;DR

Abstract

When does Metropolized Hamiltonian Monte Carlo provably outperform Metropolis-adjusted Langevin algorithm?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (43)