Ito Diffusion Approximation of Universal Ito Chains for Sampling, Optimization and Boosting
Aleksei Ustimenko, Aleksandr Beznosikov
TL;DR
The authors address diffusion approximations for a very general Ito chain that encompasses non-Gaussian, state-dependent noise and inexact drift and diffusion terms, unifying analysis across sampling, optimization, and boosting. They develop a multi-step approach—window coupling, interpolation, covariance-corrected interpolation, and entropy-based diffusion comparisons via Girsanov theory—to bound the Wasserstein-2 distance between the discrete chain and its diffusion limit. The resulting rates, expressed as $\mathcal{W}_2(\mathcal{L}(X_{k}),\mathcal{L}(Z_{k\eta}))=\mathcal{O}\big((1+(k\eta)^{1/2})e^{\mathcal{O}(k\eta)}\eta^{\theta}+ (k\eta)^{1/4}e^{\mathcal{O}(k\eta)}\eta^{\theta/2+\gamma/4}\big)$ with $\theta=\min\{\alpha, ((\gamma+1)(1+\chi_0)+(\gamma+\beta)(1-\chi_0))/4\}$, cover a broad range of settings, including SGD with Gaussian or non-Gaussian noise where $\theta$ evaluates to known rates (e.g., $\theta=1$ for certain SGD/SGLD cases). This work advances diffusion-approximation theory beyond dissipative/convex regimes and provides practical guarantees for sampling and optimization algorithms operating under general, potentially non-Gaussian noise structures.
Abstract
In this work, we consider rather general and broad class of Markov chains, Ito chains, that look like Euler-Maryama discretization of some Stochastic Differential Equation. The chain we study is a unified framework for theoretical analysis. It comes with almost arbitrary isotropic and state-dependent noise instead of normal and state-independent one as in most related papers. Moreover, in our chain the drift and diffusion coefficient can be inexact in order to cover wide range of applications as Stochastic Gradient Langevin Dynamics, sampling, Stochastic Gradient Descent or Stochastic Gradient Boosting. We prove the bound in $W_{2}$-distance between the laws of our Ito chain and corresponding differential equation. These results improve or cover most of the known estimates. And for some particular cases, our analysis is the first.
