Table of Contents
Fetching ...

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Alexander Rakhlin, Ohad Shamir, Karthik Sridharan

TL;DR

The paper analyzes the convergence of stochastic gradient descent for strongly convex stochastic optimization, showing that when the objective is smooth, SGD (with or without averaging) attains the optimal $O(1/T)$ rate, while for non-smooth objectives standard SGD averaging can incur a $\Omega(\log(T)/T)$ rate. Importantly, a simple modification—$\alpha$-suffix averaging—restores the optimal $O(1/T)$ rate for any strongly convex problem without altering the algorithm. The authors provide both theoretical guarantees and experimental evidence, including high-probability bounds and empirical results on synthetic and real data, illustrating when averaging helps or hurts. The results clarify SGD’s practical performance and offer a lightweight technique to achieve optimal rates in broader settings.

Abstract

Stochastic gradient descent (SGD) is a simple and popular method to solve stochastic optimization problems which arise in machine learning. For strongly convex problems, its convergence rate was known to be O(\log(T)/T), by running SGD for T iterations and returning the average point. However, recent results showed that using a different algorithm, one can get an optimal O(1/T) rate. This might lead one to believe that standard SGD is suboptimal, and maybe should even be replaced as a method of choice. In this paper, we investigate the optimality of SGD in a stochastic setting. We show that for smooth problems, the algorithm attains the optimal O(1/T) rate. However, for non-smooth problems, the convergence rate with averaging might really be Ω(\log(T)/T), and this is not just an artifact of the analysis. On the flip side, we show that a simple modification of the averaging step suffices to recover the O(1/T) rate, and no other change of the algorithm is necessary. We also present experimental results which support our findings, and point out open problems.

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

TL;DR

The paper analyzes the convergence of stochastic gradient descent for strongly convex stochastic optimization, showing that when the objective is smooth, SGD (with or without averaging) attains the optimal rate, while for non-smooth objectives standard SGD averaging can incur a rate. Importantly, a simple modification—-suffix averaging—restores the optimal rate for any strongly convex problem without altering the algorithm. The authors provide both theoretical guarantees and experimental evidence, including high-probability bounds and empirical results on synthetic and real data, illustrating when averaging helps or hurts. The results clarify SGD’s practical performance and offer a lightweight technique to achieve optimal rates in broader settings.

Abstract

Stochastic gradient descent (SGD) is a simple and popular method to solve stochastic optimization problems which arise in machine learning. For strongly convex problems, its convergence rate was known to be O(\log(T)/T), by running SGD for T iterations and returning the average point. However, recent results showed that using a different algorithm, one can get an optimal O(1/T) rate. This might lead one to believe that standard SGD is suboptimal, and maybe should even be replaced as a method of choice. In this paper, we investigate the optimality of SGD in a stochastic setting. We show that for smooth problems, the algorithm attains the optimal O(1/T) rate. However, for non-smooth problems, the convergence rate with averaging might really be Ω(\log(T)/T), and this is not just an artifact of the analysis. On the flip side, we show that a simple modification of the averaging step suffices to recover the O(1/T) rate, and no other change of the algorithm is necessary. We also present experimental results which support our findings, and point out open problems.

Paper Structure

This paper contains 17 sections, 13 theorems, 85 equations, 5 figures.

Key Result

Theorem 1

Suppose $F$ is $\lambda$-strongly convex and $\mu$-smooth with respect to $\mathbf{w}^*$ over a convex set $\mathcal{W}$, and that $\mathbb{E}[\|\hat{\mathbf{g}}_t\|^2]\leq G^2$. Then if we pick $\eta_t = 1/\lambda t$, it holds for any $T$ that

Figures (5)

  • Figure 1: Results for smooth strongly convex stochastic optimization problem. The experiment was repeated $10$ times, and we report the mean and standard deviation for each choice of $T$. The X-axis is the log-number of rounds $\log(T)$, and the Y-axis is $(F(\mathbf{w}_T)-F(\mathbf{w}^*))*T$. The scaling by $T$ means that a roughly constant graph corresponds to a $\Theta(1/T)$ rate, whereas a linearly increasing graph corresponds to a $\Theta(\log(T)/T)$ rate.
  • Figure 2: Results for the non-smooth strongly convex stochastic optimization problem. The experiment was repeated $10$ times, and we report the mean and standard deviation for each choice of $T$. The X-axis is the log-number of rounds $\log(T)$, and the Y-axis is $(F(\mathbf{w}_T)-F(\mathbf{w}^*))*T$. The scaling by $T$ means that a roughly constant graph corresponds to a $\Theta(1/T)$ rate, whereas a linearly increasing graph corresponds to a $\Theta(\log(T)/T)$ rate.
  • Figure 3: Results for the astro-ph dataset. The left row refers to the average loss on the training data, and the right row refers to the average loss on the test data. Each experiment was repeated $10$ times, and we report the mean and standard deviation for each choice of $T$. The X-axis is the log-number of rounds $\log(T)$, and the Y-axis is the log of the objective function $\log(F(\mathbf{w}_T))$.
  • Figure 4: Results for the ccat dataset. See Fig. \ref{['fig:astro']} caption for details.
  • Figure 5: Results for the ccat dataset. See Fig. \ref{['fig:astro']} caption for details.

Theorems & Definitions (20)

  • Theorem 1
  • Lemma 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • proof : Proof Sketch
  • Proposition 1
  • Lemma 2
  • proof
  • ...and 10 more