Table of Contents
Fetching ...

A Theoretical Comparison of No-U-Turn Sampler Variants: Necessary and Su?cient Convergence Conditions and Mixing Time Analysis under Gaussian Targets

Samuel Gruffaz, Kyurae Kim, Fares Guehtar, Hadrien Duval-decaix, Pacôme Trautmann

Abstract

The No-U-Turn Sampler (NUTS) is the computational workhorse of modern Bayesian software libraries, yet its qualitative and quantitative convergence guarantees were established only recently. A significant gap remains in the theoretical comparison of its two main variants: NUTS-mul and NUTS-BPS, which use multinomial sampling and biased progressive sampling, respectively, for index selection. In this paper, we address this gap in three contributions. First, we derive the first necessary conditions for geometric ergodicity for both variants. Second, we establish the first sufficient conditions for geometric ergodicity and ergodicity for NUTS-mul. Third, we obtain the first mixing time result for NUTS-BPS on a standard Gaussian distribution. Our results show that NUTS-mul and NUTS-BPS exhibit nearly identical qualitative behavior, with geometric ergodicity depending on the tail properties of the target distribution. However, they differ quantitatively in their convergence rates. More precisely, when initialized in the typical set of the canonical Gaussian measure, the mixing times of both NUTS-mul and NUTS-BPS scale as $O(d^{1/4})$ up to logarithmic factors, where $d$ denotes the dimension. Nevertheless, the associated constants are strictly smaller for NUTS-BPS.

A Theoretical Comparison of No-U-Turn Sampler Variants: Necessary and Su?cient Convergence Conditions and Mixing Time Analysis under Gaussian Targets

Abstract

The No-U-Turn Sampler (NUTS) is the computational workhorse of modern Bayesian software libraries, yet its qualitative and quantitative convergence guarantees were established only recently. A significant gap remains in the theoretical comparison of its two main variants: NUTS-mul and NUTS-BPS, which use multinomial sampling and biased progressive sampling, respectively, for index selection. In this paper, we address this gap in three contributions. First, we derive the first necessary conditions for geometric ergodicity for both variants. Second, we establish the first sufficient conditions for geometric ergodicity and ergodicity for NUTS-mul. Third, we obtain the first mixing time result for NUTS-BPS on a standard Gaussian distribution. Our results show that NUTS-mul and NUTS-BPS exhibit nearly identical qualitative behavior, with geometric ergodicity depending on the tail properties of the target distribution. However, they differ quantitatively in their convergence rates. More precisely, when initialized in the typical set of the canonical Gaussian measure, the mixing times of both NUTS-mul and NUTS-BPS scale as up to logarithmic factors, where denotes the dimension. Nevertheless, the associated constants are strictly smaller for NUTS-BPS.
Paper Structure (41 sections, 26 theorems, 168 equations, 7 figures)

This paper contains 41 sections, 26 theorems, 168 equations, 7 figures.

Key Result

Proposition 1

Assume that $H(\Phiverlet[h][T](q_0,p_0))=H(q_0,p_0)$ for any $T\in \mathbb{Z},(q_0,p_0)\in (\mathbb{R}^d)^2$ and that there exists some $k^*\in \mathbb{N}_{>0}$ such that for any $q_0,p_0\in (\mathbb{R}^d)^2$. Denote respectively by $\mathrm{K}^{\textup{MUL},*}_h,\mathrm{K}^{\textup{BPS},*}_h$ the kernels $\mathrm{K}^{\textup{MUL}}_h,\mathrm{K}^{\textup{BPS}}_h$ in this ideal case. Then, for any

Figures (7)

  • Figure 1: Scheme of a dynamic HMC algorithm
  • Figure 2: Scheme of the construction of the index set sampled with $\mathbf{p}_h$, based on hoffman2014no
  • Figure 3: Construction of probabilities $\mathbf{q}^{\textup{BPS}}_h$ in the example of sampling $q_0,p_0$ and $\Phiverlet[h][3](q_0,p_0)$ with $\mathbf{q}^{\textup{BPS}}_h(\cdot|\mathsf{I}-3,\Phiverlet[h][3](q_0,p_0))$.
  • Figure 4: Lebesgue density related to the limit time distribution $\mathcal{L}_{\textup{time},T_0,\infty}^{\textup{MUL}},\mathcal{L}_{\textup{time},T_0,\infty}^{\textup{BPS}}$ under the assumption of \ref{['prop:simplify_index']}.
  • Figure 5: For an index set $\mathsf{I}=[-1:6]$, the multinouilli distribution $\text{Multinouilli}((q_i)_{i\in\mathsf{I}})$ with normalized weights $\sum_{i\in \mathsf{I}} q_i=1$ is split into its maximal uniform part on $\mathsf{I}^{\text{last}}$, i.e., $|\mathsf{I}^{\text{last}}|\min_{i\in \mathsf{I}^{\text{last}}} q_i \mathcal{U}(\mathsf{I}^{\text{last}})$ (shown in blue) and the remaning part $(1-|\mathsf{I}^{\text{last}}|\min_{i\in \mathsf{I}^{\text{last}}} q_i) \text{multinouilli}((a_i)_{i\in \mathsf{I}})$ with $a_i=q_i-\mathrm{1}_{\mathsf{I}^{\text{last}}}(i)\min_{i\in \mathsf{I}^{\text{last}}} q_i$. This principle is applied to $q_i=\mathbf{q}^{\textup{BPS}}_h(i|\mathsf{I},q,p)/\sum_{j\in \mathsf{I}} \mathbf{q}^{\textup{BPS}}_h(j|\mathsf{I},q,p)$ for any $\mathsf{I}\sim \mathbf{p}_h(\cdot|q,p)$ and $q,p\in \mathsf{D}_\alpha\times \mathsf{E}_{\alpha,r}$.
  • ...and 2 more figures

Theorems & Definitions (47)

  • Definition 1: From durmus2023convergence
  • Proposition 1
  • proof
  • Definition 2
  • Theorem 1
  • Theorem 2
  • Remark 1
  • Proposition 2
  • Remark 2
  • Theorem 3
  • ...and 37 more