Table of Contents
Fetching ...

Towards Noise-adaptive, Problem-adaptive (Accelerated) Stochastic Gradient Descent

Sharan Vaswani, Benjamin Dubois-Taine, Reza Babanezhad

Abstract

We aim to make stochastic gradient descent (SGD) adaptive to (i) the noise $σ^2$ in the stochastic gradients and (ii) problem-dependent constants. When minimizing smooth, strongly-convex functions with condition number $κ$, we prove that $T$ iterations of SGD with exponentially decreasing step-sizes and knowledge of the smoothness can achieve an $\tilde{O} \left(\exp \left( \frac{-T}κ \right) + \frac{σ^2}{T} \right)$ rate, without knowing $σ^2$. In order to be adaptive to the smoothness, we use a stochastic line-search (SLS) and show (via upper and lower-bounds) that SGD with SLS converges at the desired rate, but only to a neighbourhood of the solution. On the other hand, we prove that SGD with an offline estimate of the smoothness converges to the minimizer. However, its rate is slowed down proportional to the estimation error. Next, we prove that SGD with Nesterov acceleration and exponential step-sizes (referred to as ASGD) can achieve the near-optimal $\tilde{O} \left(\exp \left( \frac{-T}{\sqrtκ} \right) + \frac{σ^2}{T} \right)$ rate, without knowledge of $σ^2$. When used with offline estimates of the smoothness and strong-convexity, ASGD still converges to the solution, albeit at a slower rate. We empirically demonstrate the effectiveness of exponential step-sizes coupled with a novel variant of SLS.

Towards Noise-adaptive, Problem-adaptive (Accelerated) Stochastic Gradient Descent

Abstract

We aim to make stochastic gradient descent (SGD) adaptive to (i) the noise in the stochastic gradients and (ii) problem-dependent constants. When minimizing smooth, strongly-convex functions with condition number , we prove that iterations of SGD with exponentially decreasing step-sizes and knowledge of the smoothness can achieve an rate, without knowing . In order to be adaptive to the smoothness, we use a stochastic line-search (SLS) and show (via upper and lower-bounds) that SGD with SLS converges at the desired rate, but only to a neighbourhood of the solution. On the other hand, we prove that SGD with an offline estimate of the smoothness converges to the minimizer. However, its rate is slowed down proportional to the estimation error. Next, we prove that SGD with Nesterov acceleration and exponential step-sizes (referred to as ASGD) can achieve the near-optimal rate, without knowledge of . When used with offline estimates of the smoothness and strong-convexity, ASGD still converges to the solution, albeit at a slower rate. We empirically demonstrate the effectiveness of exponential step-sizes coupled with a novel variant of SLS.

Paper Structure

This paper contains 37 sections, 36 theorems, 182 equations, 1 figure.

Key Result

theorem 1

Assuming (i) convexity and $L_i$-smoothness of each $f_i$, (ii) $\mu$ strong-convexity of $f$, SGD (eq:sgd) with $\gamma_{k} = \frac{1}{L}$, $\alpha_{k} = \left(\frac{\beta}{T}\right)^{k/T}$ converges as, where $c_2 = \exp\left( \frac{1}{\kappa} \cdot \frac{2\beta}{\ln(T/\beta)}\right)$.

Figures (1)

  • Figure 1: Comparison for (a) squared loss and (b) logistic loss. Observe that exponentially decreasing step-sizes (i) result in more stable performance compared to using a constant step-size (for both SGD and ASGD) and (ii) consistently outperform the noise-adaptive methods in KR-20 and M-ASG, and (iii) methods using the SLS in \ref{['eq:armijo-ls-conservative']} match the performance of those with known smoothness.

Theorems & Definitions (64)

  • theorem 1
  • theorem 2
  • theorem 3
  • theorem 4
  • theorem 5
  • theorem 6
  • theorem 7
  • theorem 8
  • Lemma 1
  • proof
  • ...and 54 more