Table of Contents
Fetching ...

Limitations of SGD for Multi-Index Models Beyond Statistical Queries

Daniel Barzilai, Ohad Shamir

TL;DR

This work provides a non-SQ framework for understanding when vanilla SGD fails on high-dimensional single-index and multi-index target functions, focusing on the alignment between the learned subspace and task directions. By introducing a gradient condition number and an alignment-based analysis, it yields general lower bounds that apply to broad architectures, including networks with a linear first layer, and Gaussian-input settings. The paper derives concrete lower bounds via Hermite expansions (information exponent) and demonstrates SGD hardness for periodic-like and multi-index targets, even without SQ-adversarial noise. It also critically discusses the limitations of SQ-based explanations and connects theoretical findings to practical concerns in gradient-based learning, offering open questions about extending these results to other gradient methods and more complex architectures.

Abstract

Understanding the limitations of gradient methods, and stochastic gradient descent (SGD) in particular, is a central challenge in learning theory. To that end, a commonly used tool is the Statistical Queries (SQ) framework, which studies performance limits of algorithms based on noisy interaction with the data. However, it is known that the formal connection between the SQ framework and SGD is tenuous: Existing results typically rely on adversarial or specially-structured gradient noise that does not reflect the noise in standard SGD, and (as we point out here) can sometimes lead to incorrect predictions. Moreover, many analyses of SGD for challenging problems rely on non-trivial algorithmic modifications, such as restricting the SGD trajectory to the sphere or using very small learning rates. To address these shortcomings, we develop a new, non-SQ framework to study the limitations of standard vanilla SGD, for single-index and multi-index models (namely, when the target function depends on a low-dimensional projection of the inputs). Our results apply to a broad class of settings and architectures, including (potentially deep) neural networks.

Limitations of SGD for Multi-Index Models Beyond Statistical Queries

TL;DR

This work provides a non-SQ framework for understanding when vanilla SGD fails on high-dimensional single-index and multi-index target functions, focusing on the alignment between the learned subspace and task directions. By introducing a gradient condition number and an alignment-based analysis, it yields general lower bounds that apply to broad architectures, including networks with a linear first layer, and Gaussian-input settings. The paper derives concrete lower bounds via Hermite expansions (information exponent) and demonstrates SGD hardness for periodic-like and multi-index targets, even without SQ-adversarial noise. It also critically discusses the limitations of SQ-based explanations and connects theoretical findings to practical concerns in gradient-based learning, offering open questions about extending these results to other gradient methods and more complex architectures.

Abstract

Understanding the limitations of gradient methods, and stochastic gradient descent (SGD) in particular, is a central challenge in learning theory. To that end, a commonly used tool is the Statistical Queries (SQ) framework, which studies performance limits of algorithms based on noisy interaction with the data. However, it is known that the formal connection between the SQ framework and SGD is tenuous: Existing results typically rely on adversarial or specially-structured gradient noise that does not reflect the noise in standard SGD, and (as we point out here) can sometimes lead to incorrect predictions. Moreover, many analyses of SGD for challenging problems rely on non-trivial algorithmic modifications, such as restricting the SGD trajectory to the sphere or using very small learning rates. To address these shortcomings, we develop a new, non-SQ framework to study the limitations of standard vanilla SGD, for single-index and multi-index models (namely, when the target function depends on a low-dimensional projection of the inputs). Our results apply to a broad class of settings and architectures, including (potentially deep) neural networks.
Paper Structure (45 sections, 34 theorems, 188 equations)

This paper contains 45 sections, 34 theorems, 188 equations.

Key Result

Theorem 2

Under Assumptions ass: inputs_all-ass: bounded, let $\delta > 0$, $\bar{\kappa}\geq 1$, and let $\psi:[0, 1]\to [0,\infty)$ be an increasing function such that $\left\|\nabla_{W} \mathcal{L}(\theta)\right\|_F \leq \psi\left(\left\|P_W P_U\right\|_\mathrm{op}\right)$ for all $\theta$. There exist con then conditioned on $\kappa_{ T} \leq \bar{\kappa}$ it holds with probability at least $1-\delta$ t

Theorems & Definitions (38)

  • Definition 1
  • Theorem 2
  • Theorem 3
  • Definition 4: Information exponent
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • Remark 8
  • Lemma 9
  • Definition 10
  • ...and 28 more