Accelerating SGDM via Learning Rate and Batch Size Schedules: A Lyapunov-Based Analysis
Yuichi Kondo, Hideaki Iiduka
TL;DR
The paper tackles convergence of stochastic gradient descent with momentum (SGDM) under dynamic learning-rate and batch-size schedules by introducing a novel Lyapunov function. It develops a unified framework that covers both mini-batch SHB and NSHB and establishes convergence bounds on the expected gradient norm for several scheduling strategies, including constant batch with decaying LR, increasing batch with decaying LR, and increasing batch with increasing LR. The key finding is a hierarchy: constant batch size may fail to guarantee convergence, increasing batch size ensures convergence, and increasing both batch size and learning rate can yield faster decay (with exponential rates under certain schedules). Empirical results on CIFAR-100 with ResNet-18 validate the theory, showing dynamically scheduled SGDM markedly accelerates convergence, with warm-up strategies often performing best and aligning with the predicted orderings of gradient norms across schedules.
Abstract
We analyze the convergence behavior of stochastic gradient descent with momentum (SGDM) under dynamic learning-rate and batch-size schedules by introducing a novel and simpler Lyapunov function. We extend the existing theoretical framework to cover three practical scheduling strategies commonly used in deep learning: a constant batch size with a decaying learning rate, an increasing batch size with a decaying learning rate, and an increasing batch size with an increasing learning rate. Our results reveal a clear hierarchy in convergence: a constant batch size does not guarantee convergence of the expected gradient norm, whereas an increasing batch size does, and simultaneously increasing both the batch size and learning rate achieves a provably faster decay. Empirical results validate our theory, showing that dynamically scheduled SGDM significantly outperforms its fixed-hyperparameter counterpart in convergence speed. We also evaluated a warm-up schedule in experiments, which empirically outperformed all other strategies in convergence behavior.
