Improved high-dimensional estimation with Langevin dynamics and stochastic weight averaging

Stanley Wei; Alex Damian; Jason D. Lee

Improved high-dimensional estimation with Langevin dynamics and stochastic weight averaging

Stanley Wei, Alex Damian, Jason D. Lee

TL;DR

The key idea is that the combination of noise-injection and iterate averaging is able to emulate the effect of landscape smoothing, and it is conjecture that minibatch SGD can also achieve the same rate without adding any additional noise.

Abstract

Significant recent work has studied the ability of gradient descent to recover a hidden planted direction $θ^\star \in S^{d-1}$ in different high-dimensional settings, including tensor PCA and single-index models. The key quantity that governs the ability of gradient descent to traverse these landscapes is the information exponent $k^\star$ (Ben Arous et al., (2021)), which corresponds to the order of the saddle at initialization in the population landscape. Ben Arous et al., (2021) showed that $n \gtrsim d^{\max(1, k^\star-1)}$ samples were necessary and sufficient for online SGD to recover $θ^\star$, and Ben Arous et al., (2020) proved a similar lower bound for Langevin dynamics. More recently, Damian et al., (2023) showed it was possible to circumvent these lower bounds by running gradient descent on a smoothed landscape, and that this algorithm succeeds with $n \gtrsim d^{\max(1, k^\star/2)}$ samples, which is optimal in the worst case. This raises the question of whether it is possible to achieve the same rate without explicit smoothing. In this paper, we show that Langevin dynamics can succeed with $n \gtrsim d^{ k^\star/2 }$ samples if one considers the average iterate, rather than the last iterate. The key idea is that the combination of noise-injection and iterate averaging is able to emulate the effect of landscape smoothing. We apply this result to both the tensor PCA and single-index model settings. Finally, we conjecture that minibatch SGD can also achieve the same rate without adding any additional noise.

Improved high-dimensional estimation with Langevin dynamics and stochastic weight averaging

TL;DR

Abstract

Significant recent work has studied the ability of gradient descent to recover a hidden planted direction

in different high-dimensional settings, including tensor PCA and single-index models. The key quantity that governs the ability of gradient descent to traverse these landscapes is the information exponent

(Ben Arous et al., (2021)), which corresponds to the order of the saddle at initialization in the population landscape. Ben Arous et al., (2021) showed that

samples were necessary and sufficient for online SGD to recover

, and Ben Arous et al., (2020) proved a similar lower bound for Langevin dynamics. More recently, Damian et al., (2023) showed it was possible to circumvent these lower bounds by running gradient descent on a smoothed landscape, and that this algorithm succeeds with

samples, which is optimal in the worst case. This raises the question of whether it is possible to achieve the same rate without explicit smoothing. In this paper, we show that Langevin dynamics can succeed with

samples if one considers the average iterate, rather than the last iterate. The key idea is that the combination of noise-injection and iterate averaging is able to emulate the effect of landscape smoothing. We apply this result to both the tensor PCA and single-index model settings. Finally, we conjecture that minibatch SGD can also achieve the same rate without adding any additional noise.

Paper Structure (30 sections, 52 theorems, 205 equations, 2 figures, 1 algorithm)

This paper contains 30 sections, 52 theorems, 205 equations, 2 figures, 1 algorithm.

Introduction
Setup and Main Contributions
Notation
Setting
Tensor PCA
Single-Index Models
The Learning Algorithm
Main Contributions
Main Results
Overview of Proof Ideas
Ergodic Concentration
Analyzing the Error Component $E$
Recovery of $\theta^\star$
Discussion
Experiments
...and 15 more sections

Key Result

Theorem 1

Consider a link function $\sigma$ with information exponent $k^\star$. Then, with $n\gtrsim d^{\lceil k^\star/2 \rceil}$ samples drawn i.i.d. from the standard $d$-dimensional Gaussian, running algo: training algo recovers the ground truth direction $\theta^\star$.

Figures (2)

Figure 1: We run with $d=100$ with $n=10d^{\lceil k^\star/2 \rceil}$ samples, using various learning rates. Here, the dark curves correspond to the correlation of the time average as a function of iteration, in which it indeed converges to the direction of $\theta^\star$. The light curves correspond to the actual iterate as a function of time, which can be seen to stay near the equator over the entire training process.
Figure 2: Simulations for $k^\star=4$, run with $d=100$ with $n=10d^2$ samples.

Theorems & Definitions (98)

Theorem 1: Main theorem (informal)
Definition 1: Probabilist's Hermite polynomials
Definition 2: Hermite coefficients
Definition 3: Information exponent
Example 1
Definition 4: Spherical gradient operator
Theorem 2: Odd $k^\star$
Corollary 1
Theorem 3: Even $k^\star$
Definition 5: Markov semigroup
...and 88 more

Improved high-dimensional estimation with Langevin dynamics and stochastic weight averaging

TL;DR

Abstract

Improved high-dimensional estimation with Langevin dynamics and stochastic weight averaging

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (98)