A Theoretical Comparison of No-U-Turn Sampler Variants: Necessary and Su?cient Convergence Conditions and Mixing Time Analysis under Gaussian Targets

Samuel Gruffaz; Kyurae Kim; Fares Guehtar; Hadrien Duval-decaix; Pacôme Trautmann

A Theoretical Comparison of No-U-Turn Sampler Variants: Necessary and Su?cient Convergence Conditions and Mixing Time Analysis under Gaussian Targets

Samuel Gruffaz, Kyurae Kim, Fares Guehtar, Hadrien Duval-decaix, Pacôme Trautmann

Abstract

The No-U-Turn Sampler (NUTS) is the computational workhorse of modern Bayesian software libraries, yet its qualitative and quantitative convergence guarantees were established only recently. A significant gap remains in the theoretical comparison of its two main variants: NUTS-mul and NUTS-BPS, which use multinomial sampling and biased progressive sampling, respectively, for index selection. In this paper, we address this gap in three contributions. First, we derive the first necessary conditions for geometric ergodicity for both variants. Second, we establish the first sufficient conditions for geometric ergodicity and ergodicity for NUTS-mul. Third, we obtain the first mixing time result for NUTS-BPS on a standard Gaussian distribution. Our results show that NUTS-mul and NUTS-BPS exhibit nearly identical qualitative behavior, with geometric ergodicity depending on the tail properties of the target distribution. However, they differ quantitatively in their convergence rates. More precisely, when initialized in the typical set of the canonical Gaussian measure, the mixing times of both NUTS-mul and NUTS-BPS scale as $O(d^{1/4})$ up to logarithmic factors, where $d$ denotes the dimension. Nevertheless, the associated constants are strictly smaller for NUTS-BPS.

A Theoretical Comparison of No-U-Turn Sampler Variants: Necessary and Su?cient Convergence Conditions and Mixing Time Analysis under Gaussian Targets

Abstract

up to logarithmic factors, where

denotes the dimension. Nevertheless, the associated constants are strictly smaller for NUTS-BPS.

Paper Structure (41 sections, 26 theorems, 168 equations, 7 figures)

This paper contains 41 sections, 26 theorems, 168 equations, 7 figures.

Introduction
Notation
Premilinary on Hamiltonian Monte Carlo and NUTS
Hamiltonian Monte Carlo
Selecting $T$.
NUTS
The NUTS orbit selection kernel $\mathbf{p}_h$.
Multinomial (mul) index selection kernel $\mathbf{q}^{\textup{MUL}}_h$.
Biased progressive sampling (BPS) index selection kernel $\mathbf{q}^{\textup{BPS}}_h$.
Intuitive understanding and comparison of $\mathbf{q}^{\textup{MUL}}_h,\mathbf{q}^{\textup{BPS}}_h$.
Quantitative analysis of mixing time
Mixing time for NUTS-BPS
Comparison of the optimal constants in the high-dimensional limit
Optimal constants.
Necessary conditions for geometric ergodicity
...and 26 more sections

Key Result

Proposition 1

Assume that $H(\Phiverlet[h][T](q_0,p_0))=H(q_0,p_0)$ for any $T\in \mathbb{Z},(q_0,p_0)\in (\mathbb{R}^d)^2$ and that there exists some $k^*\in \mathbb{N}_{>0}$ such that for any $q_0,p_0\in (\mathbb{R}^d)^2$. Denote respectively by $\mathrm{K}^{\textup{MUL},*}_h,\mathrm{K}^{\textup{BPS},*}_h$ the kernels $\mathrm{K}^{\textup{MUL}}_h,\mathrm{K}^{\textup{BPS}}_h$ in this ideal case. Then, for any

Figures (7)

Figure 1: Scheme of a dynamic HMC algorithm
Figure 2: Scheme of the construction of the index set sampled with $\mathbf{p}_h$, based on hoffman2014no
Figure 3: Construction of probabilities $\mathbf{q}^{\textup{BPS}}_h$ in the example of sampling $q_0,p_0$ and $\Phiverlet[h][3](q_0,p_0)$ with $\mathbf{q}^{\textup{BPS}}_h(\cdot|\mathsf{I}-3,\Phiverlet[h][3](q_0,p_0))$.
Figure 4: Lebesgue density related to the limit time distribution $\mathcal{L}_{\textup{time},T_0,\infty}^{\textup{MUL}},\mathcal{L}_{\textup{time},T_0,\infty}^{\textup{BPS}}$ under the assumption of \ref{['prop:simplify_index']}.
Figure 5: For an index set $\mathsf{I}=[-1:6]$, the multinouilli distribution $\text{Multinouilli}((q_i)_{i\in\mathsf{I}})$ with normalized weights $\sum_{i\in \mathsf{I}} q_i=1$ is split into its maximal uniform part on $\mathsf{I}^{\text{last}}$, i.e., $|\mathsf{I}^{\text{last}}|\min_{i\in \mathsf{I}^{\text{last}}} q_i \mathcal{U}(\mathsf{I}^{\text{last}})$ (shown in blue) and the remaning part $(1-|\mathsf{I}^{\text{last}}|\min_{i\in \mathsf{I}^{\text{last}}} q_i) \text{multinouilli}((a_i)_{i\in \mathsf{I}})$ with $a_i=q_i-\mathrm{1}_{\mathsf{I}^{\text{last}}}(i)\min_{i\in \mathsf{I}^{\text{last}}} q_i$. This principle is applied to $q_i=\mathbf{q}^{\textup{BPS}}_h(i|\mathsf{I},q,p)/\sum_{j\in \mathsf{I}} \mathbf{q}^{\textup{BPS}}_h(j|\mathsf{I},q,p)$ for any $\mathsf{I}\sim \mathbf{p}_h(\cdot|q,p)$ and $q,p\in \mathsf{D}_\alpha\times \mathsf{E}_{\alpha,r}$.
...and 2 more figures

Theorems & Definitions (47)

Definition 1: From durmus2023convergence
Proposition 1
proof
Definition 2
Theorem 1
Theorem 2
Remark 1
Proposition 2
Remark 2
Theorem 3
...and 37 more

A Theoretical Comparison of No-U-Turn Sampler Variants: Necessary and Su?cient Convergence Conditions and Mixing Time Analysis under Gaussian Targets

Abstract

A Theoretical Comparison of No-U-Turn Sampler Variants: Necessary and Su?cient Convergence Conditions and Mixing Time Analysis under Gaussian Targets

Authors

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (47)