Convex SGD: Generalization Without Early Stopping
Julien Hendrickx, Alex Olshevsky
TL;DR
The paper analyzes generalization for projected SGD on smooth convex losses over a compact domain, showing that strong convexity is not required for good generalization. By recasting PSGD as an inexact gradient method and bounding the resulting perturbations via a concentration framework, it derives a uniform-in-$T$ bound: for step sizes $\alpha_t \le 1/L$ and $\alpha_t = O(1/\sqrt{t})$, the generalization error obeys $\epsilon_{gen}(t,n) \le \bar{\epsilon}_{opt}(t,n) + O\big((\sigma^* R + LR^2 \sqrt{d})/\sqrt{n}\big)$, which yields $\tilde{O}(1/\sqrt{T} + 1/\sqrt{n})$ when $\alpha_t = 1/\sqrt{t}$. Unlike stability-based analyses, the proof relies on a perturbation argument and a Modified Dudley inequality to bound gradient-difference concentration, avoiding early stopping. The results imply that convex SGD can generalize well without explicit early stopping, though a dimension-dependent term $\sqrt{d}$ is unavoidable under the stated assumptions. The paper also presents a lower-bound construction showing that dimension dependence cannot be eliminated, highlighting intrinsic limits of uniform generalization guarantees in high-dimensional settings.
Abstract
We consider the generalization error associated with stochastic gradient descent on a smooth convex function over a compact set. We show the first bound on the generalization error that vanishes when the number of iterations $T$ and the dataset size $n$ go to zero at arbitrary rates; our bound scales as $\tilde{O}(1/\sqrt{T} + 1/\sqrt{n})$ with step-size $α_t = 1/\sqrt{t}$. In particular, strong convexity is not needed for stochastic gradient descent to generalize well.
