Limitations of SGD for Multi-Index Models Beyond Statistical Queries
Daniel Barzilai, Ohad Shamir
TL;DR
This work provides a non-SQ framework for understanding when vanilla SGD fails on high-dimensional single-index and multi-index target functions, focusing on the alignment between the learned subspace and task directions. By introducing a gradient condition number and an alignment-based analysis, it yields general lower bounds that apply to broad architectures, including networks with a linear first layer, and Gaussian-input settings. The paper derives concrete lower bounds via Hermite expansions (information exponent) and demonstrates SGD hardness for periodic-like and multi-index targets, even without SQ-adversarial noise. It also critically discusses the limitations of SQ-based explanations and connects theoretical findings to practical concerns in gradient-based learning, offering open questions about extending these results to other gradient methods and more complex architectures.
Abstract
Understanding the limitations of gradient methods, and stochastic gradient descent (SGD) in particular, is a central challenge in learning theory. To that end, a commonly used tool is the Statistical Queries (SQ) framework, which studies performance limits of algorithms based on noisy interaction with the data. However, it is known that the formal connection between the SQ framework and SGD is tenuous: Existing results typically rely on adversarial or specially-structured gradient noise that does not reflect the noise in standard SGD, and (as we point out here) can sometimes lead to incorrect predictions. Moreover, many analyses of SGD for challenging problems rely on non-trivial algorithmic modifications, such as restricting the SGD trajectory to the sphere or using very small learning rates. To address these shortcomings, we develop a new, non-SQ framework to study the limitations of standard vanilla SGD, for single-index and multi-index models (namely, when the target function depends on a low-dimensional projection of the inputs). Our results apply to a broad class of settings and architectures, including (potentially deep) neural networks.
