Increasing Both Batch Size and Learning Rate Accelerates Stochastic Gradient Descent

Hikaru Umeda; Hideaki Iiduka

Increasing Both Batch Size and Learning Rate Accelerates Stochastic Gradient Descent

Hikaru Umeda, Hideaki Iiduka

TL;DR

The paper investigates how joint scheduling of batch size and learning rate affects stochastic gradient descent in nonconvex ERM, deriving a unified bound on the best gradient norm across iterations. It analyzes four practical schedulers: constant batch with decaying LR, increasing batch with decaying LR, increasing batch with increasing LR, and increasing batch with warm-up decaying LR, and proves that increasing both batch size and LR can accelerate convergence with rates tied to $B_T$ and $V_T$. Theoretical results are complemented by numerical experiments on CNNs showing that increasing batch size with warm-up yields the fastest convergence, broadly validating the proposed approach. The work highlights practical implications for training deep networks efficiently and lays groundwork for broader empirical validation across architectures and datasets.

Abstract

The performance of mini-batch stochastic gradient descent (SGD) strongly depends on setting the batch size and learning rate to minimize the empirical loss in training the deep neural network. In this paper, we present theoretical analyses of mini-batch SGD with four schedulers: (i) constant batch size and decaying learning rate scheduler, (ii) increasing batch size and decaying learning rate scheduler, (iii) increasing batch size and increasing learning rate scheduler, and (iv) increasing batch size and warm-up decaying learning rate scheduler. We show that mini-batch SGD using scheduler (i) does not always minimize the expectation of the full gradient norm of the empirical loss, whereas it does using any of schedulers (ii), (iii), and (iv). Furthermore, schedulers (iii) and (iv) accelerate mini-batch SGD. The paper also provides numerical results of supporting analyses showing that using scheduler (iii) or (iv) minimizes the full gradient norm of the empirical loss faster than using scheduler (i) or (ii).

Increasing Both Batch Size and Learning Rate Accelerates Stochastic Gradient Descent

TL;DR

and

. Theoretical results are complemented by numerical experiments on CNNs showing that increasing batch size with warm-up yields the fastest convergence, broadly validating the proposed approach. The work highlights practical implications for training deep networks efficiently and lays groundwork for broader empirical validation across architectures and datasets.

Abstract

Paper Structure (24 sections, 9 theorems, 125 equations, 21 figures, 2 tables, 1 algorithm)

This paper contains 24 sections, 9 theorems, 125 equations, 21 figures, 2 tables, 1 algorithm.

Introduction
Mini-batch SGD for Empirical Risk Minimization
Empirical Risk Minimization
Mini-batch SGD
Convergence Analysis of Mini-batch SGD
Constant Batch Size and Decaying Learning Rate Scheduler
Increasing Batch Size and Decaying Learning Rate Scheduler
Increasing Batch Size and Increasing Learning Rate Scheduler
Increasing Batch Size and Warm-up Decaying Learning Rate Scheduler
Comparisons of Our Convergence Rate Results with Existing Ones
Comparisons of Convergence Rates under Nonconvexity with Ones under Convexity
Numerical Results
Conclusion
Appendix
Example of Stochastic Gradient satisfying (A2) under (A1)
...and 9 more sections

Key Result

Lemma 2.1

Suppose that Assumption assum:1 holds and consider the sequence $(\bm{\theta}_t)$ generated by Algorithm algo:1 with $\eta_t \in [\eta_{\min}, \eta_{\max}] \subset [0, \frac{2}{\bar{L}})$ satisfying $\sum_{t=0}^{T-1} \eta_t \neq 0$, where $\bar{L} := \frac{1}{n} \sum_{i\in [n]} L_i$ and $\underline{ where $\mathbb{E}$ denotes the total expectation, defined by $\mathbb{E} := \mathbb{E}_{\bm{\xi}_0}

Figures (21)

Figure 5: (a) Decaying learning rates (constant, diminishing, cosine, linear, and polynomial) and constant batch size, (b) full gradient norm of empirical loss, (c) empirical loss value, and (d) accuracy score in testing for SGD to train ResNet-18 on CIFAR100 dataset.
Figure 6: (a) Decaying learning rates and doubly increasing batch size every 30 epochs, (b) full gradient norm of empirical loss, (c) empirical loss value, and (d) accuracy score in testing for SGD to train ResNet-18 on CIFAR100 dataset.
Figure 11: (a) Increasing learning rates ($\eta_{\max} = 0.2, 0.5, 1.0$) and doubly increasing batch size every 30 epochs, (b) full gradient norm of empirical loss, (c) empirical loss value, and (d) accuracy score in testing for SGD to train ResNet-18 on CIFAR100 dataset.
Figure 16: (a) Warm-up learning rates and doubly increasing batch size every 30 epochs, (b) full gradient norm of empirical loss, (c) empirical loss value, and (d) accuracy score in testing for SGD to train ResNet-18 on CIFAR100 dataset.
Figure 21: (a) Increasing learning rates and increasing batch sizes based on $\delta = 2,3,4$, (b) full gradient norm of empirical loss, (c) empirical loss value, and (d) accuracy score in testing for SGD to train ResNet-18 on CIFAR100 dataset.
...and 16 more figures

Theorems & Definitions (9)

Lemma 2.1
Theorem 3.1: Upper bound on $\min_t \mathbb{E}\|\nabla f(\bm{\theta}_t)\|^2$ for SGD using (\ref{['scheduler_1']})
Theorem 3.2: Convergence rate of SGD using (\ref{['scheduler_2']})
Theorem 3.3: Convergence rate of SGD using (\ref{['scheduler_3']})
Theorem 3.4: Convergence rate of SGD using (\ref{['warm_up']})
Proposition A.1
Theorem A.1: Convergence rate of SGD using (\ref{['scheduler_2']})
Theorem A.2: Convergence rate of SGD using (\ref{['scheduler_3']})
Lemma A.1

Increasing Both Batch Size and Learning Rate Accelerates Stochastic Gradient Descent

TL;DR

Abstract

Increasing Both Batch Size and Learning Rate Accelerates Stochastic Gradient Descent

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (21)

Theorems & Definitions (9)