Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization
Alexander Rakhlin, Ohad Shamir, Karthik Sridharan
TL;DR
The paper analyzes the convergence of stochastic gradient descent for strongly convex stochastic optimization, showing that when the objective is smooth, SGD (with or without averaging) attains the optimal $O(1/T)$ rate, while for non-smooth objectives standard SGD averaging can incur a $\Omega(\log(T)/T)$ rate. Importantly, a simple modification—$\alpha$-suffix averaging—restores the optimal $O(1/T)$ rate for any strongly convex problem without altering the algorithm. The authors provide both theoretical guarantees and experimental evidence, including high-probability bounds and empirical results on synthetic and real data, illustrating when averaging helps or hurts. The results clarify SGD’s practical performance and offer a lightweight technique to achieve optimal rates in broader settings.
Abstract
Stochastic gradient descent (SGD) is a simple and popular method to solve stochastic optimization problems which arise in machine learning. For strongly convex problems, its convergence rate was known to be O(\log(T)/T), by running SGD for T iterations and returning the average point. However, recent results showed that using a different algorithm, one can get an optimal O(1/T) rate. This might lead one to believe that standard SGD is suboptimal, and maybe should even be replaced as a method of choice. In this paper, we investigate the optimality of SGD in a stochastic setting. We show that for smooth problems, the algorithm attains the optimal O(1/T) rate. However, for non-smooth problems, the convergence rate with averaging might really be Ω(\log(T)/T), and this is not just an artifact of the analysis. On the flip side, we show that a simple modification of the averaging step suffices to recover the O(1/T) rate, and no other change of the algorithm is necessary. We also present experimental results which support our findings, and point out open problems.
