All ERMs Can Fail in Stochastic Convex Optimization Lower Bounds in Linear Dimension

Tal Burla; Roi Livni

All ERMs Can Fail in Stochastic Convex Optimization Lower Bounds in Linear Dimension

Tal Burla, Roi Livni

TL;DR

This work investigates generalization in overparameterized stochastic convex optimization by constructing a linear-dimension instance where the best-case ERM is unique yet overfits, resolving Feldman’s open question. It extends the analysis to approximate ERMs and derives a new lower bound for generalization under Gradient Descent, showing that constrained GD can overfit when horizon and learning rate scale with sample size, with a lower bound of $Ω(\,\sqrt{ηT/m^{1.5}})$. The paper also establishes that every $\Theta(m^{-3/2})$-ERM incurs a constant population excess risk, and presents a tight coupling between ERM behavior and GD dynamics across regimes, thereby highlighting the limits of ERMs and implicit biases in explaining learnability in SCO. The results sharpen the understanding of learnability in convex settings and point to open questions about extensions to smooth objectives and intermediate-accuracy ERMs.

Abstract

We study the sample complexity of the best-case Empirical Risk Minimizer in the setting of stochastic convex optimization. We show that there exists an instance in which the sample size is linear in the dimension, learning is possible, but the Empirical Risk Minimizer is likely to be unique and to overfit. This resolves an open question by Feldman. We also extend this to approximate ERMs. Building on our construction we also show that (constrained) Gradient Descent potentially overfits when horizon and learning rate grow w.r.t sample size. Specifically we provide a novel generalization lower bound of $Ω\left(\sqrt{ηT/m^{1.5}}\right)$ for Gradient Descent, where $η$ is the learning rate, $T$ is the horizon and $m$ is the sample size. This narrows down, exponentially, the gap between the best known upper bound of $O(ηT/m)$ and existing lower bounds from previous constructions.

All ERMs Can Fail in Stochastic Convex Optimization Lower Bounds in Linear Dimension

TL;DR

. The paper also establishes that every

-ERM incurs a constant population excess risk, and presents a tight coupling between ERM behavior and GD dynamics across regimes, thereby highlighting the limits of ERMs and implicit biases in explaining learnability in SCO. The results sharpen the understanding of learnability in convex settings and point to open questions about extensions to smooth objectives and intermediate-accuracy ERMs.

Abstract

for Gradient Descent, where

is the learning rate,

is the horizon and

is the sample size. This narrows down, exponentially, the gap between the best known upper bound of

and existing lower bounds from previous constructions.

Paper Structure (35 sections, 7 theorems, 104 equations)

This paper contains 35 sections, 7 theorems, 104 equations.

Introduction
Insights on Gradient Descent
Related Work
The Sample Complexity of ERMs
Generalization Bounds for Gradient Descent
Preliminaries and Setup
Learning.
Empirical Risk Minimizers.
Non-ERM learners
Gradient Descent.
Main Results
Generalization Lower Bounds for Empirical Risk Minimizers
Applications and Extensions to Gradient Descent
Discussion
Approximate ERMs
...and 20 more sections

Key Result

theorem 1

Fix any $m\in \mathbb{N}$, then there exists $\varepsilon = \Theta(m^{-3/2})$ and a finite instance space $\mathcal{Z}$, a distribution $D$ over $\mathcal{Z}$, and a $1$-Lipschitz loss $f:\mathcal{W}_{d}\times\mathcal{Z}\to\mathbb{R}$ in dimension $d=6\cdot m$ such that, with probability at least $1 Moreover, $f$ is $\lambda$-strongly convex with $\lambda=m^{-3/2}$.

Theorems & Definitions (10)

theorem 1
theorem 2
corollary 3
theorem 4
lemma 5
lemma 6: boucheron2003concentration, example 6.13
claim 1
lemma 7
Proof
claim 2

All ERMs Can Fail in Stochastic Convex Optimization Lower Bounds in Linear Dimension

TL;DR

Abstract

All ERMs Can Fail in Stochastic Convex Optimization Lower Bounds in Linear Dimension

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (10)