Convex SGD: Generalization Without Early Stopping

Julien Hendrickx; Alex Olshevsky

Convex SGD: Generalization Without Early Stopping

Julien Hendrickx, Alex Olshevsky

TL;DR

The paper analyzes generalization for projected SGD on smooth convex losses over a compact domain, showing that strong convexity is not required for good generalization. By recasting PSGD as an inexact gradient method and bounding the resulting perturbations via a concentration framework, it derives a uniform-in-$T$ bound: for step sizes $\alpha_t \le 1/L$ and $\alpha_t = O(1/\sqrt{t})$, the generalization error obeys $\epsilon_{gen}(t,n) \le \bar{\epsilon}_{opt}(t,n) + O\big((\sigma^* R + LR^2 \sqrt{d})/\sqrt{n}\big)$, which yields $\tilde{O}(1/\sqrt{T} + 1/\sqrt{n})$ when $\alpha_t = 1/\sqrt{t}$. Unlike stability-based analyses, the proof relies on a perturbation argument and a Modified Dudley inequality to bound gradient-difference concentration, avoiding early stopping. The results imply that convex SGD can generalize well without explicit early stopping, though a dimension-dependent term $\sqrt{d}$ is unavoidable under the stated assumptions. The paper also presents a lower-bound construction showing that dimension dependence cannot be eliminated, highlighting intrinsic limits of uniform generalization guarantees in high-dimensional settings.

Abstract

We consider the generalization error associated with stochastic gradient descent on a smooth convex function over a compact set. We show the first bound on the generalization error that vanishes when the number of iterations $T$ and the dataset size $n$ go to zero at arbitrary rates; our bound scales as $\tilde{O}(1/\sqrt{T} + 1/\sqrt{n})$ with step-size $α_t = 1/\sqrt{t}$. In particular, strong convexity is not needed for stochastic gradient descent to generalize well.

Convex SGD: Generalization Without Early Stopping

TL;DR

bound: for step sizes

and

, the generalization error obeys

, which yields

when

. Unlike stability-based analyses, the proof relies on a perturbation argument and a Modified Dudley inequality to bound gradient-difference concentration, avoiding early stopping. The results imply that convex SGD can generalize well without explicit early stopping, though a dimension-dependent term

is unavoidable under the stated assumptions. The paper also presents a lower-bound construction showing that dimension dependence cannot be eliminated, highlighting intrinsic limits of uniform generalization guarantees in high-dimensional settings.

Abstract

and the dataset size

go to zero at arbitrary rates; our bound scales as

with step-size

. In particular, strong convexity is not needed for stochastic gradient descent to generalize well.

Paper Structure (12 sections, 11 theorems, 118 equations, 1 figure, 1 table)

This paper contains 12 sections, 11 theorems, 118 equations, 1 figure, 1 table.

Introduction
Literature Review
Our Contribution
Our Main Result
Problem Settings and Statement
Proof of our main result
Concentration of gradient differences
Modified Dudley's Inequality
Application to Gradient Concentration
Inevitability of Dimension Dependence
Conclusion
Acknowledgements

Key Result

Theorem 2

Under Assumption assumption:function and provided the step sizes are bounded as $\alpha_t\leq \frac{1}{L}$, we have

Figures (1)

Figure 1: Loss function used in our counterexample.

Theorems & Definitions (11)

Theorem 2
Lemma 3
Lemma 4
Proposition 5
Lemma 6
Theorem 8
Lemma 9
Lemma 10
Lemma 11
Lemma 12
...and 1 more

Convex SGD: Generalization Without Early Stopping

TL;DR

Abstract

Convex SGD: Generalization Without Early Stopping

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (11)