Table of Contents
Fetching ...

Stochastic Gradient Langevin Dynamics with Variance Reduction

Zhishen Huang, Stephen Becker

TL;DR

This work studies stochastic gradient Langevin dynamics with variance reduction (SGLD-VR) for nonconvex optimization, combining SVRG-style gradient estimators with Gaussian noise to promote global exploration while converging to minimizers. It proves an ergodicity property, showing the method can visit broad regions of the search space, and establishes convergence guarantees to both first-order and, under a strict saddle condition, second-order stationary points. The main results include an improved time complexity to reach an $\varepsilon$-first-order stationary point and a bound for converging to an $\varepsilon$-second-order stationary point, with detailed proofs based on Lyapunov analysis, recurrence/reachability, and saddle-point escape arguments. Collectively, these contributions justify SGLD-VR as a viable global optimization tool for nonconvex empirical risk minimization problems, offering stronger exploration guarantees and faster convergence than standard SGLD variants when variance is reduced.

Abstract

Stochastic gradient Langevin dynamics (SGLD) has gained the attention of optimization researchers due to its global optimization properties. This paper proves an improved convergence property to local minimizers of nonconvex objective functions using SGLD accelerated by variance reductions. Moreover, we prove an ergodicity property of the SGLD scheme, which gives insights on its potential to find global minimizers of nonconvex objectives.

Stochastic Gradient Langevin Dynamics with Variance Reduction

TL;DR

This work studies stochastic gradient Langevin dynamics with variance reduction (SGLD-VR) for nonconvex optimization, combining SVRG-style gradient estimators with Gaussian noise to promote global exploration while converging to minimizers. It proves an ergodicity property, showing the method can visit broad regions of the search space, and establishes convergence guarantees to both first-order and, under a strict saddle condition, second-order stationary points. The main results include an improved time complexity to reach an -first-order stationary point and a bound for converging to an -second-order stationary point, with detailed proofs based on Lyapunov analysis, recurrence/reachability, and saddle-point escape arguments. Collectively, these contributions justify SGLD-VR as a viable global optimization tool for nonconvex empirical risk minimization problems, offering stronger exploration guarantees and faster convergence than standard SGLD variants when variance is reduced.

Abstract

Stochastic gradient Langevin dynamics (SGLD) has gained the attention of optimization researchers due to its global optimization properties. This paper proves an improved convergence property to local minimizers of nonconvex objective functions using SGLD accelerated by variance reductions. Moreover, we prove an ergodicity property of the SGLD scheme, which gives insights on its potential to find global minimizers of nonconvex objectives.

Paper Structure

This paper contains 28 sections, 18 theorems, 82 equations, 1 figure, 2 tables, 1 algorithm.

Key Result

Theorem 1

Under Assumption as::grad_lipschitz_ld, for any $p\in(0,1)$, then with probability at least $1-p$, the time complexity for the LD described in Algorithm alg::LD to converge to an $\varepsilon$-first order stationary point $\mathbf{x}^\star$ is $\mathcal{O}\left( \frac{\Delta_f d}{ \varepsilon^2 p} \

Figures (1)

  • Figure 1: Example of training a neural net for binary classification with 2 hidden layers, $n=1000$ training points, sigmoid activation function, $\eta_0=10^3$, $\nu=1$, $\rho_0=10^{-2}$, batch size $B_b=100$, and $B_e = 10$. SGLD-VR converges to a good solution, in terms of both training and testing error, more quickly than either SGLD or regular SGD. The $\ell_2^2$ loss was used for training, but the error reported in the figure is the misclassification rate.

Theorems & Definitions (32)

  • Theorem 1
  • Remark 1
  • Theorem 2: Ergodicity
  • Definition 1
  • Theorem 3
  • Lemma 4: Bound of variance of SVRG gradient estimator Reddi_SVRG_first_order_2016
  • Lemma 5
  • Lemma 6
  • Lemma 7: Recurrence
  • Remark 2
  • ...and 22 more