Increasing Both Batch Size and Learning Rate Accelerates Stochastic Gradient Descent
Hikaru Umeda, Hideaki Iiduka
TL;DR
The paper investigates how joint scheduling of batch size and learning rate affects stochastic gradient descent in nonconvex ERM, deriving a unified bound on the best gradient norm across iterations. It analyzes four practical schedulers: constant batch with decaying LR, increasing batch with decaying LR, increasing batch with increasing LR, and increasing batch with warm-up decaying LR, and proves that increasing both batch size and LR can accelerate convergence with rates tied to $B_T$ and $V_T$. Theoretical results are complemented by numerical experiments on CNNs showing that increasing batch size with warm-up yields the fastest convergence, broadly validating the proposed approach. The work highlights practical implications for training deep networks efficiently and lays groundwork for broader empirical validation across architectures and datasets.
Abstract
The performance of mini-batch stochastic gradient descent (SGD) strongly depends on setting the batch size and learning rate to minimize the empirical loss in training the deep neural network. In this paper, we present theoretical analyses of mini-batch SGD with four schedulers: (i) constant batch size and decaying learning rate scheduler, (ii) increasing batch size and decaying learning rate scheduler, (iii) increasing batch size and increasing learning rate scheduler, and (iv) increasing batch size and warm-up decaying learning rate scheduler. We show that mini-batch SGD using scheduler (i) does not always minimize the expectation of the full gradient norm of the empirical loss, whereas it does using any of schedulers (ii), (iii), and (iv). Furthermore, schedulers (iii) and (iv) accelerate mini-batch SGD. The paper also provides numerical results of supporting analyses showing that using scheduler (iii) or (iv) minimizes the full gradient norm of the empirical loss faster than using scheduler (i) or (ii).
