Table of Contents
Fetching ...

Increasing Both Batch Size and Learning Rate Accelerates Stochastic Gradient Descent

Hikaru Umeda, Hideaki Iiduka

TL;DR

The paper investigates how joint scheduling of batch size and learning rate affects stochastic gradient descent in nonconvex ERM, deriving a unified bound on the best gradient norm across iterations. It analyzes four practical schedulers: constant batch with decaying LR, increasing batch with decaying LR, increasing batch with increasing LR, and increasing batch with warm-up decaying LR, and proves that increasing both batch size and LR can accelerate convergence with rates tied to $B_T$ and $V_T$. Theoretical results are complemented by numerical experiments on CNNs showing that increasing batch size with warm-up yields the fastest convergence, broadly validating the proposed approach. The work highlights practical implications for training deep networks efficiently and lays groundwork for broader empirical validation across architectures and datasets.

Abstract

The performance of mini-batch stochastic gradient descent (SGD) strongly depends on setting the batch size and learning rate to minimize the empirical loss in training the deep neural network. In this paper, we present theoretical analyses of mini-batch SGD with four schedulers: (i) constant batch size and decaying learning rate scheduler, (ii) increasing batch size and decaying learning rate scheduler, (iii) increasing batch size and increasing learning rate scheduler, and (iv) increasing batch size and warm-up decaying learning rate scheduler. We show that mini-batch SGD using scheduler (i) does not always minimize the expectation of the full gradient norm of the empirical loss, whereas it does using any of schedulers (ii), (iii), and (iv). Furthermore, schedulers (iii) and (iv) accelerate mini-batch SGD. The paper also provides numerical results of supporting analyses showing that using scheduler (iii) or (iv) minimizes the full gradient norm of the empirical loss faster than using scheduler (i) or (ii).

Increasing Both Batch Size and Learning Rate Accelerates Stochastic Gradient Descent

TL;DR

The paper investigates how joint scheduling of batch size and learning rate affects stochastic gradient descent in nonconvex ERM, deriving a unified bound on the best gradient norm across iterations. It analyzes four practical schedulers: constant batch with decaying LR, increasing batch with decaying LR, increasing batch with increasing LR, and increasing batch with warm-up decaying LR, and proves that increasing both batch size and LR can accelerate convergence with rates tied to and . Theoretical results are complemented by numerical experiments on CNNs showing that increasing batch size with warm-up yields the fastest convergence, broadly validating the proposed approach. The work highlights practical implications for training deep networks efficiently and lays groundwork for broader empirical validation across architectures and datasets.

Abstract

The performance of mini-batch stochastic gradient descent (SGD) strongly depends on setting the batch size and learning rate to minimize the empirical loss in training the deep neural network. In this paper, we present theoretical analyses of mini-batch SGD with four schedulers: (i) constant batch size and decaying learning rate scheduler, (ii) increasing batch size and decaying learning rate scheduler, (iii) increasing batch size and increasing learning rate scheduler, and (iv) increasing batch size and warm-up decaying learning rate scheduler. We show that mini-batch SGD using scheduler (i) does not always minimize the expectation of the full gradient norm of the empirical loss, whereas it does using any of schedulers (ii), (iii), and (iv). Furthermore, schedulers (iii) and (iv) accelerate mini-batch SGD. The paper also provides numerical results of supporting analyses showing that using scheduler (iii) or (iv) minimizes the full gradient norm of the empirical loss faster than using scheduler (i) or (ii).
Paper Structure (24 sections, 9 theorems, 125 equations, 21 figures, 2 tables, 1 algorithm)

This paper contains 24 sections, 9 theorems, 125 equations, 21 figures, 2 tables, 1 algorithm.

Key Result

Lemma 2.1

Suppose that Assumption assum:1 holds and consider the sequence $(\bm{\theta}_t)$ generated by Algorithm algo:1 with $\eta_t \in [\eta_{\min}, \eta_{\max}] \subset [0, \frac{2}{\bar{L}})$ satisfying $\sum_{t=0}^{T-1} \eta_t \neq 0$, where $\bar{L} := \frac{1}{n} \sum_{i\in [n]} L_i$ and $\underline{ where $\mathbb{E}$ denotes the total expectation, defined by $\mathbb{E} := \mathbb{E}_{\bm{\xi}_0}

Figures (21)

  • Figure 5: (a) Decaying learning rates (constant, diminishing, cosine, linear, and polynomial) and constant batch size, (b) full gradient norm of empirical loss, (c) empirical loss value, and (d) accuracy score in testing for SGD to train ResNet-18 on CIFAR100 dataset.
  • Figure 6: (a) Decaying learning rates and doubly increasing batch size every 30 epochs, (b) full gradient norm of empirical loss, (c) empirical loss value, and (d) accuracy score in testing for SGD to train ResNet-18 on CIFAR100 dataset.
  • Figure 11: (a) Increasing learning rates ($\eta_{\max} = 0.2, 0.5, 1.0$) and doubly increasing batch size every 30 epochs, (b) full gradient norm of empirical loss, (c) empirical loss value, and (d) accuracy score in testing for SGD to train ResNet-18 on CIFAR100 dataset.
  • Figure 16: (a) Warm-up learning rates and doubly increasing batch size every 30 epochs, (b) full gradient norm of empirical loss, (c) empirical loss value, and (d) accuracy score in testing for SGD to train ResNet-18 on CIFAR100 dataset.
  • Figure 21: (a) Increasing learning rates and increasing batch sizes based on $\delta = 2,3,4$, (b) full gradient norm of empirical loss, (c) empirical loss value, and (d) accuracy score in testing for SGD to train ResNet-18 on CIFAR100 dataset.
  • ...and 16 more figures

Theorems & Definitions (9)

  • Lemma 2.1
  • Theorem 3.1: Upper bound on $\min_t \mathbb{E}\|\nabla f(\bm{\theta}_t)\|^2$ for SGD using (\ref{['scheduler_1']})
  • Theorem 3.2: Convergence rate of SGD using (\ref{['scheduler_2']})
  • Theorem 3.3: Convergence rate of SGD using (\ref{['scheduler_3']})
  • Theorem 3.4: Convergence rate of SGD using (\ref{['warm_up']})
  • Proposition A.1
  • Theorem A.1: Convergence rate of SGD using (\ref{['scheduler_2']})
  • Theorem A.2: Convergence rate of SGD using (\ref{['scheduler_3']})
  • Lemma A.1