Enhancing Low-Precision Sampling via Stochastic Gradient Hamiltonian Monte Carlo

Ziyi Wang; Yujie Chen; Qifan Song; Ruqi Zhang

Enhancing Low-Precision Sampling via Stochastic Gradient Hamiltonian Monte Carlo

Ziyi Wang, Yujie Chen, Qifan Song, Ruqi Zhang

TL;DR

The paper addresses the challenge of efficient Bayesian sampling under low-precision arithmetic by proposing and analyzing low-precision SGHMC. It develops three variants—SGHMCLP-F, SGHMCLP-L, and VC SGHMCLP-L—with rigorous non-asymptotic bounds in 2-Wasserstein distance, demonstrating faster convergence and robustness to quantization than SGLD, especially for non-log-concave targets. The key contributions include a complete theoretical treatment of low-precision SGHMC across strongly log-concave and non-log-concave regimes, the introduction of variance-corrected quantization to mitigate overdispersion, and extensive experiments on Gaussian distributions, MNIST, and CIFAR datasets showing practical gains for resource-constrained settings. The results suggest that low-precision SGHMC is a viable, efficient approach for sampling in large-scale Bayesian deep learning, offering both speed and uncertainty quantification benefits in low-precision environments.

Abstract

Low-precision training has emerged as a promising low-cost technique to enhance the training efficiency of deep neural networks without sacrificing much accuracy. Its Bayesian counterpart can further provide uncertainty quantification and improved generalization accuracy. This paper investigates low-precision sampling via Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) with low-precision and full-precision gradient accumulators for both strongly log-concave and non-log-concave distributions. Theoretically, our results show that, to achieve $ε$-error in the 2-Wasserstein distance for non-log-concave distributions, low-precision SGHMC achieves quadratic improvement ($\widetilde{\mathbf{O}}\left({ε^{-2}{μ^*}^{-2}\log^2\left({ε^{-1}}\right)}\right)$) compared to the state-of-the-art low-precision sampler, Stochastic Gradient Langevin Dynamics (SGLD) ($\widetilde{\mathbf{O}}\left({ε^{-4}{λ^{*}}^{-1}\log^5\left({ε^{-1}}\right)}\right)$). Moreover, we prove that low-precision SGHMC is more robust to the quantization error compared to low-precision SGLD due to the robustness of the momentum-based update w.r.t. gradient noise. Empirically, we conduct experiments on synthetic data, and {MNIST, CIFAR-10 \& CIFAR-100} datasets, which validate our theoretical findings. Our study highlights the potential of low-precision SGHMC as an efficient and accurate sampling method for large-scale and resource-limited machine learning.

Enhancing Low-Precision Sampling via Stochastic Gradient Hamiltonian Monte Carlo

TL;DR

Abstract

-error in the 2-Wasserstein distance for non-log-concave distributions, low-precision SGHMC achieves quadratic improvement (

) compared to the state-of-the-art low-precision sampler, Stochastic Gradient Langevin Dynamics (SGLD) (

). Moreover, we prove that low-precision SGHMC is more robust to the quantization error compared to low-precision SGLD due to the robustness of the momentum-based update w.r.t. gradient noise. Empirically, we conduct experiments on synthetic data, and {MNIST, CIFAR-10 \& CIFAR-100} datasets, which validate our theoretical findings. Our study highlights the potential of low-precision SGHMC as an efficient and accurate sampling method for large-scale and resource-limited machine learning.

Paper Structure (49 sections, 25 theorems, 312 equations, 9 figures, 3 tables, 2 algorithms)

This paper contains 49 sections, 25 theorems, 312 equations, 9 figures, 3 tables, 2 algorithms.

Introduction
Preliminaries
Low-Precision Quantization
Low-precision Stochastic Gradient Langevin Dynamics
Stochastic Gradient Hamiltonian Monte Carlo
Low-Precision Stochastic Gradient Hamiltonian Monte Carlo
Full-Precision Gradient Accumulators
Low-Precision Gradient Accumulators
Variance Correction
Experiments
Sampling from standard Gaussian and Gaussian mixture distributions
MNIST
CIFAR-10 & CIFAR-100
Fixed Point
Block Floating Point
...and 34 more sections

Key Result

Theorem 1

Assuming assum:smooth, assum:dissaptive and assum:variance hold. Let $p^*$ denote the target distribution of $({\mathbf{x}}, {\mathbf{v}})$. If $\gamma^2 \leq 4Mu$ and setting the step size $\eta = \tilde{\mathcal{O}}\left(\frac{\mu^* \epsilon^{2}}{\log\left(1/\epsilon\right)}\right)$ satisfying then after $K$ steps starting at the initial point ${\mathbf{x}}_0={\mathbf{v}}_0=0$, the output $({\

Figures (9)

Figure 1: Low-precision SGHMC on a Gaussian distribution. (a): SGHMCLP-L. (b): VC SGHMCLP-L. (c): SGHMCLP-F. VC SGHMCLP-L and SGHMCLP-F converge to the true distribution, whereas naïve SGHMCLP-L suffers a larger variance.
Figure 2: Low-precision SGHMC with on a Gaussian mixture distribution. (a): SGHMCLP-L. (b): VC SGHMCLP-L. (c): SGHMCLP-F. VC SGHMCLP-L and SGHMCLP-F converge to the true distribution, whereas naïve SGHMCLP-L suffers a larger variance.
Figure 3: Log $L_2$ distance from sample density estimation obtained by low-precision SGHMC and SGLD to the Gaussian mixture distribution. (a) Low-precision gradient accumulators. (b): Full-precision gradient accumulators. Overall, SGHMC methods enjoy a faster convergence speed. In particular, SGHMCLP-L achieves a lower distance compared to SGLDLP-L and VC SGLDLP-L.
Figure 4: Mean (dotted line) and 95% confidence interval (shaded area) of 2-Wasserstein error ratio between VC SGHMCLP-L & SGHMCLP-L (Smaller means the variance correction is more effective), computed over 5 experimental runs. The x-axis represents the ratio between $\mathrm{Var}_{\mathbf{x}}^{hmc}$ and $\Delta^2/4$.
Figure 5: Training NLL of low-precision SGHMC and SGLD on logistic model with MNIST in terms of different numbers of fractional bits. (a): Full-precision gradient accumulators. (b): Low-precision gradient accumulators. (c): Variance-corrected quantizer. SGHMCLP-F achieves comparable results with SGLDLP-F. However, both SGHMCLP-L and VC SGHMCLP-L show more robustness to quantization error, especially when the number of representable bits is low. Please be aware of the different scales of y-axis across three figures.
...and 4 more figures

Theorems & Definitions (36)

Definition 1
Theorem 1
Theorem 2
Theorem 3
Theorem 4
Theorem 5
Theorem 6
Theorem 7
Theorem 8
Theorem 9
...and 26 more

Enhancing Low-Precision Sampling via Stochastic Gradient Hamiltonian Monte Carlo

TL;DR

Abstract

Enhancing Low-Precision Sampling via Stochastic Gradient Hamiltonian Monte Carlo

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (36)