Lower Bounds for Non-Convex Stochastic Optimization

Yossi Arjevani; Yair Carmon; John C. Duchi; Dylan J. Foster; Nathan Srebro; Blake Woodworth

Lower Bounds for Non-Convex Stochastic Optimization

Yossi Arjevani, Yair Carmon, John C. Duchi, Dylan J. Foster, Nathan Srebro, Blake Woodworth

TL;DR

This work establishes tight distributional lower bounds for stochastic first-order methods in non-convex optimization, showing that finding an $\epsilon$-stationary point requires at least $\Omega(\Delta L \sigma^{2} / \epsilon^{4})$ queries under bounded-variance oracles and at least $\Omega(\Delta \bar{L} \sigma / \epsilon^{3} + \sigma^{2}/\epsilon^{2})$ under mean-squared smoothness, with dimensions scaling polynomially in $1/\epsilon$. By leveraging probabilistic zero-chains and random rotations, the authors prove that SGD is minimax-optimal in the bounded-variance setting and that variance-reduction methods are optimal under MSS, clarifying the fundamental limits and separations between MSS and non-MSS regimes. The results extend to learning-type and active oracle models, as well as finite-sum structures, and imply a separation between non-convex stochastic optimization and convex settings in terms of the $\epsilon^{-4}$ vs $\epsilon^{-2}$ scaling. The paper also outlines several open questions, including the MSS bound with a single query ($K=1$), stronger oracle assumptions, and extensions to higher-order algorithms.

Abstract

We lower bound the complexity of finding $ε$-stationary points (with gradient norm at most $ε$) using stochastic first-order methods. In a well-studied model where algorithms access smooth, potentially non-convex functions through queries to an unbiased stochastic gradient oracle with bounded variance, we prove that (in the worst case) any algorithm requires at least $ε^{-4}$ queries to find an $ε$ stationary point. The lower bound is tight, and establishes that stochastic gradient descent is minimax optimal in this model. In a more restrictive model where the noisy gradient estimates satisfy a mean-squared smoothness property, we prove a lower bound of $ε^{-3}$ queries, establishing the optimality of recently proposed variance reduction techniques.

Lower Bounds for Non-Convex Stochastic Optimization

TL;DR

This work establishes tight distributional lower bounds for stochastic first-order methods in non-convex optimization, showing that finding an

-stationary point requires at least

queries under bounded-variance oracles and at least

under mean-squared smoothness, with dimensions scaling polynomially in

. By leveraging probabilistic zero-chains and random rotations, the authors prove that SGD is minimax-optimal in the bounded-variance setting and that variance-reduction methods are optimal under MSS, clarifying the fundamental limits and separations between MSS and non-MSS regimes. The results extend to learning-type and active oracle models, as well as finite-sum structures, and imply a separation between non-convex stochastic optimization and convex settings in terms of the

scaling. The paper also outlines several open questions, including the MSS bound with a single query (

), stronger oracle assumptions, and extensions to higher-order algorithms.

Abstract

We lower bound the complexity of finding

-stationary points (with gradient norm at most

) using stochastic first-order methods. In a well-studied model where algorithms access smooth, potentially non-convex functions through queries to an unbiased stochastic gradient oracle with bounded variance, we prove that (in the worst case) any algorithm requires at least

queries to find an

stationary point. The lower bound is tight, and establishes that stochastic gradient descent is minimax optimal in this model. In a more restrictive model where the noisy gradient estimates satisfy a mean-squared smoothness property, we prove a lower bound of

queries, establishing the optimality of recently proposed variance reduction techniques.

Lower Bounds for Non-Convex Stochastic Optimization

TL;DR

Abstract

Lower Bounds for Non-Convex Stochastic Optimization

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (39)