Table of Contents
Fetching ...

Improved Learning Rates for Stochastic Optimization

Shaojie Li, Pengwei Tang, Yong Liu

Abstract

Stochastic optimization is a cornerstone of modern machine learning. This paper studies the generalization performance of two classical stochastic optimization algorithms: stochastic gradient descent (SGD) and Nesterov's accelerated gradient (NAG). We establish new learning rates for both algorithms, with improved guarantees in some settings or comparable rates under weaker assumptions in others. We also provide numerical experiments to support the theory.

Improved Learning Rates for Stochastic Optimization

Abstract

Stochastic optimization is a cornerstone of modern machine learning. This paper studies the generalization performance of two classical stochastic optimization algorithms: stochastic gradient descent (SGD) and Nesterov's accelerated gradient (NAG). We establish new learning rates for both algorithms, with improved guarantees in some settings or comparable rates under weaker assumptions in others. We also provide numerical experiments to support the theory.

Paper Structure

This paper contains 31 sections, 20 theorems, 263 equations, 5 figures.

Key Result

Theorem 1

Suppose Assumptions assu4, assu7, assu8 and assu5 hold, and suppose the population risk $F$ satisfies Assumption assu10 with parameter $\mu$. Let $\{ \mathbf{w}_t\}_t$ be the sequence produced by (eq1) with $\eta_t = \eta_1 t^{- 1/2}$ such that $\eta_1 \leq \frac{1}{2\beta}$. When $n \geq \frac{c\be if further assuming $F^{\ast} = \mathcal{O}(1/n)$, we have where $c$ is an absolute constant.

Figures (5)

  • Figure 1: The excess risk $F(\mathbf{w}) - F^{\ast}$ versus the number of iterations for the logistic link function across different datasets: Breast-Cancer, German, Heart, and IJCNN.
  • Figure 2: The excess risk $F(\mathbf{w}) - F^{\ast}$ versus the number of iterations for the probit link function across different datasets: Breast-Cancer, German, Heart, and IJCNN.
  • Figure 3: The excess risk $F(\mathbf{w}) - F^{\ast}$ versus the number of samples for the probit link function (left) and the logistic link function (right) on the IJCNN dataset.
  • Figure 4: The excess risk $F(\mathbf{w})-F^\ast$ versus the number of iterations (left) and the number of samples (right) on the MNIST dataset for image classification.
  • Figure 5: The excess risk $F(\mathbf{w})-F^\ast$ versus the number of iterations (left) and the number of samples (right) on the SMS Spam Collection dataset for spam detection.

Theorems & Definitions (43)

  • Remark 1
  • Remark 2
  • Remark 3
  • Remark 4
  • Remark 5
  • Remark 6
  • Theorem 1
  • Theorem 2
  • Remark 7
  • Theorem 3
  • ...and 33 more