Table of Contents
Fetching ...

Lower Bounds for Non-Convex Stochastic Optimization

Yossi Arjevani, Yair Carmon, John C. Duchi, Dylan J. Foster, Nathan Srebro, Blake Woodworth

TL;DR

This work establishes tight distributional lower bounds for stochastic first-order methods in non-convex optimization, showing that finding an $\epsilon$-stationary point requires at least $\Omega(\Delta L \sigma^{2} / \epsilon^{4})$ queries under bounded-variance oracles and at least $\Omega(\Delta \bar{L} \sigma / \epsilon^{3} + \sigma^{2}/\epsilon^{2})$ under mean-squared smoothness, with dimensions scaling polynomially in $1/\epsilon$. By leveraging probabilistic zero-chains and random rotations, the authors prove that SGD is minimax-optimal in the bounded-variance setting and that variance-reduction methods are optimal under MSS, clarifying the fundamental limits and separations between MSS and non-MSS regimes. The results extend to learning-type and active oracle models, as well as finite-sum structures, and imply a separation between non-convex stochastic optimization and convex settings in terms of the $\epsilon^{-4}$ vs $\epsilon^{-2}$ scaling. The paper also outlines several open questions, including the MSS bound with a single query ($K=1$), stronger oracle assumptions, and extensions to higher-order algorithms.

Abstract

We lower bound the complexity of finding $ε$-stationary points (with gradient norm at most $ε$) using stochastic first-order methods. In a well-studied model where algorithms access smooth, potentially non-convex functions through queries to an unbiased stochastic gradient oracle with bounded variance, we prove that (in the worst case) any algorithm requires at least $ε^{-4}$ queries to find an $ε$ stationary point. The lower bound is tight, and establishes that stochastic gradient descent is minimax optimal in this model. In a more restrictive model where the noisy gradient estimates satisfy a mean-squared smoothness property, we prove a lower bound of $ε^{-3}$ queries, establishing the optimality of recently proposed variance reduction techniques.

Lower Bounds for Non-Convex Stochastic Optimization

TL;DR

This work establishes tight distributional lower bounds for stochastic first-order methods in non-convex optimization, showing that finding an -stationary point requires at least queries under bounded-variance oracles and at least under mean-squared smoothness, with dimensions scaling polynomially in . By leveraging probabilistic zero-chains and random rotations, the authors prove that SGD is minimax-optimal in the bounded-variance setting and that variance-reduction methods are optimal under MSS, clarifying the fundamental limits and separations between MSS and non-MSS regimes. The results extend to learning-type and active oracle models, as well as finite-sum structures, and imply a separation between non-convex stochastic optimization and convex settings in terms of the vs scaling. The paper also outlines several open questions, including the MSS bound with a single query (), stronger oracle assumptions, and extensions to higher-order algorithms.

Abstract

We lower bound the complexity of finding -stationary points (with gradient norm at most ) using stochastic first-order methods. In a well-studied model where algorithms access smooth, potentially non-convex functions through queries to an unbiased stochastic gradient oracle with bounded variance, we prove that (in the worst case) any algorithm requires at least queries to find an stationary point. The lower bound is tight, and establishes that stochastic gradient descent is minimax optimal in this model. In a more restrictive model where the noisy gradient estimates satisfy a mean-squared smoothness property, we prove a lower bound of queries, establishing the optimality of recently proposed variance reduction techniques.

Paper Structure

This paper contains 40 sections, 22 theorems, 191 equations, 1 figure.

Key Result

Lemma 1

Let $g(x,z)$ be a probability-$p$ zero-chain gradient estimator for $F:\mathbb{R}^{T}\to\mathbb{R}$, and let $\mathsf{O}$ be any oracle with $\mathsf{O}_{F}(x,z)=(F(x),g(x,z))$. Let $\{[\}]{x^{(t,k)}_{\mathsf{A}[\mathsf{O}_F]}}$ be the queries of any $\mathsf{A}\in\mathcal{A}_{\textnormal{zr}}(K)$ i

Figures (1)

  • Figure 1: The construction $\Gamma$ in Eq. eq:theshfunc-def and its derivatives; obs:thresh is evident.

Theorems & Definitions (39)

  • Definition 1
  • Definition 2
  • Lemma 1
  • proof
  • Lemma 2: carmon2019lower_i
  • Lemma 3
  • proof
  • Theorem 1
  • proof : Proof of thm:zr_population
  • Lemma 4
  • ...and 29 more