Table of Contents
Fetching ...

Faster Convergence of Riemannian Stochastic Gradient Descent with Increasing Batch Size

Kanata Oowada, Hideaki Iiduka

TL;DR

It is found that using an increasing batch size leads to faster convergence than using a constant batch size, not only with a constant learning rate but also with a decaying learning rate, such as cosine annealing decay and polynomial decay.

Abstract

We theoretically analyzed the convergence behavior of Riemannian stochastic gradient descent (RSGD) and found that using an increasing batch size leads to faster convergence than using a constant batch size, not only with a constant learning rate but also with a decaying learning rate, such as cosine annealing decay and polynomial decay. The convergence rate improves from $O(T^{-1}+C)$ with a constant batch size to $O(T^{-1})$ with an increasing batch size, where $T$ denotes the total number of iterations and $C$ is a constant. Using principal component analysis and low-rank matrix completion, we investigated, both theoretically and numerically, how an increasing batch size affects computational time as quantified by stochastic first-order oracle (SFO) complexity. An increasing batch size was found to reduce the SFO complexity of RSGD. Furthermore, an increasing batch size was found to offer the advantages of both small and large constant batch sizes.

Faster Convergence of Riemannian Stochastic Gradient Descent with Increasing Batch Size

TL;DR

It is found that using an increasing batch size leads to faster convergence than using a constant batch size, not only with a constant learning rate but also with a decaying learning rate, such as cosine annealing decay and polynomial decay.

Abstract

We theoretically analyzed the convergence behavior of Riemannian stochastic gradient descent (RSGD) and found that using an increasing batch size leads to faster convergence than using a constant batch size, not only with a constant learning rate but also with a decaying learning rate, such as cosine annealing decay and polynomial decay. The convergence rate improves from with a constant batch size to with an increasing batch size, where denotes the total number of iterations and is a constant. Using principal component analysis and low-rank matrix completion, we investigated, both theoretically and numerically, how an increasing batch size affects computational time as quantified by stochastic first-order oracle (SFO) complexity. An increasing batch size was found to reduce the SFO complexity of RSGD. Furthermore, an increasing batch size was found to offer the advantages of both small and large constant batch sizes.

Paper Structure

This paper contains 31 sections, 10 theorems, 110 equations, 28 figures, 3 tables.

Key Result

lemma 1

Let $(x_t)_t$ be a sequence generated by RSGD and let $\eta_{\max}>0$. Consider a positive-valued sequence $(\eta_t)_t$ such that $\eta_t \in [0,\eta_{\max}] \subset [0, \frac{2}{L_r})$. Then, under Assumptions asm:rtr_smooth and asm:sto_gra, we obtain

Figures (28)

  • Figure 1: Norm of the gradient of the objective function versus the number of iterations for LRs \ref{['eq:const_lr']}, \ref{['eq:dim_lr']}, \ref{['eq:cosan_lr']}, and \ref{['eq:poly_dec_lr']} on COIL100 dataset (PCA).
  • Figure 2: Norm of the gradient of the objective function versus the number of iterations for LRs \ref{['eq:const_lr']}, \ref{['eq:dim_lr']}, \ref{['eq:cosan_lr']}, and \ref{['eq:poly_dec_lr']} on MNIST dataset (PCA).
  • Figure 3: Norm of the gradient of the objective function versus the number of iterations for LRs \ref{['eq:const_lr']}, \ref{['eq:dim_lr']}, \ref{['eq:cosan_lr']}, and \ref{['eq:poly_dec_lr']} on MovieLens-1M dataset (LRMC).
  • Figure 4: Norm of the gradient of the objective function versus the number of iterations for LRs \ref{['eq:const_lr']}, \ref{['eq:dim_lr']}, \ref{['eq:cosan_lr']}, and \ref{['eq:poly_dec_lr']} on Jester dataset (LRMC).
  • Figure 5: Norm of objective function gradient versus SFO complexity. Datasets used were COIL100 (PCA), MNIST (PCA), MovieLens-1M (LRMC), and Jester (LRMC) in order from left to right. A cosine annealing LR was used except for COIL100, for which a constant LR was used. For 'BS_increases$=0$,' a constant BS $b=b_0$ was used. For 'BS_increases$=3$’ and 'BS_increases$=6$,' the BS was increased $3$ and $6$ times, respectively, in accordance with the exponential growth BS.
  • ...and 23 more figures

Theorems & Definitions (19)

  • definition 1: Retraction
  • lemma 1: Underlying Analysis
  • theorem 3
  • theorem 4
  • theorem 5
  • theorem 6
  • theorem 7
  • remark 1
  • proposition 1
  • proof
  • ...and 9 more