Table of Contents
Fetching ...

Near-Optimal Streaming Heavy-Tailed Statistical Estimation with Clipped SGD

Aniket Das, Dheeraj Nagaraj, Soumyabrata Pal, Arun Suggala, Prateek Varshney

TL;DR

This work proves that the widely used Clipped-SGD algorithm attains near-optimal sub-Gaussian statistical rates whenever the second moment of the stochastic gradient noise is finite, and introduces a novel iterative refinement strategy for martingale concentration.

Abstract

We consider the problem of high-dimensional heavy-tailed statistical estimation in the streaming setting, which is much harder than the traditional batch setting due to memory constraints. We cast this problem as stochastic convex optimization with heavy tailed stochastic gradients, and prove that the widely used Clipped-SGD algorithm attains near-optimal sub-Gaussian statistical rates whenever the second moment of the stochastic gradient noise is finite. More precisely, with $T$ samples, we show that Clipped-SGD, for smooth and strongly convex objectives, achieves an error of $\sqrt{\frac{\mathsf{Tr}(Σ)+\sqrt{\mathsf{Tr}(Σ)\|Σ\|_2}\log(\frac{\log(T)}δ)}{T}}$ with probability $1-δ$, where $Σ$ is the covariance of the clipped gradient. Note that the fluctuations (depending on $\frac{1}δ$) are of lower order than the term $\mathsf{Tr}(Σ)$. This improves upon the current best rate of $\sqrt{\frac{\mathsf{Tr}(Σ)\log(\frac{1}δ)}{T}}$ for Clipped-SGD, known only for smooth and strongly convex objectives. Our results also extend to smooth convex and lipschitz convex objectives. Key to our result is a novel iterative refinement strategy for martingale concentration, improving upon the PAC-Bayes approach of Catoni and Giulini.

Near-Optimal Streaming Heavy-Tailed Statistical Estimation with Clipped SGD

TL;DR

This work proves that the widely used Clipped-SGD algorithm attains near-optimal sub-Gaussian statistical rates whenever the second moment of the stochastic gradient noise is finite, and introduces a novel iterative refinement strategy for martingale concentration.

Abstract

We consider the problem of high-dimensional heavy-tailed statistical estimation in the streaming setting, which is much harder than the traditional batch setting due to memory constraints. We cast this problem as stochastic convex optimization with heavy tailed stochastic gradients, and prove that the widely used Clipped-SGD algorithm attains near-optimal sub-Gaussian statistical rates whenever the second moment of the stochastic gradient noise is finite. More precisely, with samples, we show that Clipped-SGD, for smooth and strongly convex objectives, achieves an error of with probability , where is the covariance of the clipped gradient. Note that the fluctuations (depending on ) are of lower order than the term . This improves upon the current best rate of for Clipped-SGD, known only for smooth and strongly convex objectives. Our results also extend to smooth convex and lipschitz convex objectives. Key to our result is a novel iterative refinement strategy for martingale concentration, improving upon the PAC-Bayes approach of Catoni and Giulini.

Paper Structure

This paper contains 66 sections, 40 theorems, 338 equations, 1 table, 1 algorithm.

Key Result

Theorem 1

Let the as:smoothness, as:strong-convexity and as:second_moment assumptions be satisfied. Then, for any $\delta \in (0, 1/2)$, the last iterate of Algorithm alg:SGDcl run for $T \gtrsim \ln(\ln(d))$ iterations with stepsize $\eta_t = \tfrac{4}{\mu (t + \gamma)}$ and clipping level $\Gamma = \tfrac{\ where $\gamma \asymp \max\{ \tfrac{\|\Sigma\|_2\kappa^2 \ln(\ln(T)/\delta)^2}{\mathsf{Tr}(\Sigma)},

Theorems & Definitions (53)

  • Theorem 1: Smooth Strongly Convex Objectives
  • Theorem 2: Smooth Strongly Convex Objectives with Quadratic Growth Noise Model
  • Theorem 3: Smooth Convex Objectives
  • Theorem 4: Lipschitz Convex Objectives
  • Corollary 1: Heavy Tailed Mean Estimation
  • Corollary 2: Heavy Tailed Linear Regression
  • Corollary 3: Heavy Tailed Logistic Regression
  • Corollary 4: Heavy Tailed LAD Regression
  • Theorem 5
  • Lemma 1
  • ...and 43 more