Table of Contents
Fetching ...

All ERMs Can Fail in Stochastic Convex Optimization Lower Bounds in Linear Dimension

Tal Burla, Roi Livni

TL;DR

This work investigates generalization in overparameterized stochastic convex optimization by constructing a linear-dimension instance where the best-case ERM is unique yet overfits, resolving Feldman’s open question. It extends the analysis to approximate ERMs and derives a new lower bound for generalization under Gradient Descent, showing that constrained GD can overfit when horizon and learning rate scale with sample size, with a lower bound of $Ω(\,\sqrt{ηT/m^{1.5}})$. The paper also establishes that every $\Theta(m^{-3/2})$-ERM incurs a constant population excess risk, and presents a tight coupling between ERM behavior and GD dynamics across regimes, thereby highlighting the limits of ERMs and implicit biases in explaining learnability in SCO. The results sharpen the understanding of learnability in convex settings and point to open questions about extensions to smooth objectives and intermediate-accuracy ERMs.

Abstract

We study the sample complexity of the best-case Empirical Risk Minimizer in the setting of stochastic convex optimization. We show that there exists an instance in which the sample size is linear in the dimension, learning is possible, but the Empirical Risk Minimizer is likely to be unique and to overfit. This resolves an open question by Feldman. We also extend this to approximate ERMs. Building on our construction we also show that (constrained) Gradient Descent potentially overfits when horizon and learning rate grow w.r.t sample size. Specifically we provide a novel generalization lower bound of $Ω\left(\sqrt{ηT/m^{1.5}}\right)$ for Gradient Descent, where $η$ is the learning rate, $T$ is the horizon and $m$ is the sample size. This narrows down, exponentially, the gap between the best known upper bound of $O(ηT/m)$ and existing lower bounds from previous constructions.

All ERMs Can Fail in Stochastic Convex Optimization Lower Bounds in Linear Dimension

TL;DR

This work investigates generalization in overparameterized stochastic convex optimization by constructing a linear-dimension instance where the best-case ERM is unique yet overfits, resolving Feldman’s open question. It extends the analysis to approximate ERMs and derives a new lower bound for generalization under Gradient Descent, showing that constrained GD can overfit when horizon and learning rate scale with sample size, with a lower bound of . The paper also establishes that every -ERM incurs a constant population excess risk, and presents a tight coupling between ERM behavior and GD dynamics across regimes, thereby highlighting the limits of ERMs and implicit biases in explaining learnability in SCO. The results sharpen the understanding of learnability in convex settings and point to open questions about extensions to smooth objectives and intermediate-accuracy ERMs.

Abstract

We study the sample complexity of the best-case Empirical Risk Minimizer in the setting of stochastic convex optimization. We show that there exists an instance in which the sample size is linear in the dimension, learning is possible, but the Empirical Risk Minimizer is likely to be unique and to overfit. This resolves an open question by Feldman. We also extend this to approximate ERMs. Building on our construction we also show that (constrained) Gradient Descent potentially overfits when horizon and learning rate grow w.r.t sample size. Specifically we provide a novel generalization lower bound of for Gradient Descent, where is the learning rate, is the horizon and is the sample size. This narrows down, exponentially, the gap between the best known upper bound of and existing lower bounds from previous constructions.
Paper Structure (35 sections, 7 theorems, 104 equations)

This paper contains 35 sections, 7 theorems, 104 equations.

Key Result

theorem 1

Fix any $m\in \mathbb{N}$, then there exists $\varepsilon = \Theta(m^{-3/2})$ and a finite instance space $\mathcal{Z}$, a distribution $D$ over $\mathcal{Z}$, and a $1$-Lipschitz loss $f:\mathcal{W}_{d}\times\mathcal{Z}\to\mathbb{R}$ in dimension $d=6\cdot m$ such that, with probability at least $1 Moreover, $f$ is $\lambda$-strongly convex with $\lambda=m^{-3/2}$.

Theorems & Definitions (10)

  • theorem 1
  • theorem 2
  • corollary 3
  • theorem 4
  • lemma 5
  • lemma 6: boucheron2003concentration, example 6.13
  • claim 1
  • lemma 7
  • Proof
  • claim 2