Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling

Shuaipeng Li; Penghao Zhao; Hailin Zhang; Xingwu Sun; Hao Wu; Dian Jiao; Weiyan Wang; Chengjun Liu; Zheng Fang; Jinbao Xue; Yangyu Tao; Bin Cui; Di Wang

Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling

Shuaipeng Li, Penghao Zhao, Hailin Zhang, Xingwu Sun, Hao Wu, Dian Jiao, Weiyan Wang, Chengjun Liu, Zheng Fang, Jinbao Xue, Yangyu Tao, Bin Cui, Di Wang

TL;DR

This work reveals a non-monotone scaling law between batch size and the optimal learning rate for Adam-style optimizers, showing a surge that peaks at a data-noise balance point $B_{peak}=B_{noise}$ and shifts to larger batch sizes as training progresses. The authors derive the LR optimality condition under sign-based updates with Gaussian gradient estimates, and they validate the theory through extensive CV/NLP experiments, demonstrating the practical value of adaptive batch-size and LR strategies. The findings challenge SGD-inspired linear or square-root scaling for Adam and highlight a principled trade-off between training speed and data efficiency. Overall, the work provides a theoretical framework and empirical evidence to guide hyperparameter tuning for large-scale, sign-based optimizers.

Abstract

In current deep learning tasks, Adam style optimizers such as Adam, Adagrad, RMSProp, Adafactor, and Lion have been widely used as alternatives to SGD style optimizers. These optimizers typically update model parameters using the sign of gradients, resulting in more stable convergence curves. The learning rate and the batch size are the most critical hyperparameters for optimizers, which require careful tuning to enable effective convergence. Previous research has shown that the optimal learning rate increases linearly or follows similar rules with batch size for SGD style optimizers. However, this conclusion is not applicable to Adam style optimizers. In this paper, we elucidate the connection between optimal learning rates and batch sizes for Adam style optimizers through both theoretical analysis and extensive experiments. First, we raise the scaling law between batch sizes and optimal learning rates in the sign of gradient case, in which we prove that the optimal learning rate first rises and then falls as the batch size increases. Moreover, the peak value of the surge will gradually move toward the larger batch size as training progresses. Second, we conducted experiments on various CV and NLP tasks and verified the correctness of the scaling law.

Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling

TL;DR

This work reveals a non-monotone scaling law between batch size and the optimal learning rate for Adam-style optimizers, showing a surge that peaks at a data-noise balance point

and shifts to larger batch sizes as training progresses. The authors derive the LR optimality condition under sign-based updates with Gaussian gradient estimates, and they validate the theory through extensive CV/NLP experiments, demonstrating the practical value of adaptive batch-size and LR strategies. The findings challenge SGD-inspired linear or square-root scaling for Adam and highlight a principled trade-off between training speed and data efficiency. Overall, the work provides a theoretical framework and empirical evidence to guide hyperparameter tuning for large-scale, sign-based optimizers.

Abstract

Paper Structure (20 sections, 5 theorems, 46 equations, 8 figures, 1 table)

This paper contains 20 sections, 5 theorems, 46 equations, 8 figures, 1 table.

Introduction
Theorems
Batch Size and Optimal Learning Rate
Data/Time Efficiency Trade-off
Summary
Experiments
Experimental Setup
Variable Estimation
Results
Discussion
Related Work
Conclusion
Parameter Update Amount in the Adam Optimizer
Proof of Lemma \ref{['theorem:opt:grad']}
Proof of Theorem \ref{['theorem:optlr:gaussian']}
...and 5 more sections

Key Result

Lemma 1

Suppose that we are updating the parameter $\theta$ using the mini-batch gradient $V$, with the true gradient being $G$ and the true Hessian being $H$. Then the optimal learning rate that maximizes the decrease in loss is: and the corresponding loss improvement $\Delta L$ is:

Figures (8)

Figure 1: The relationship between the optimal learning rate and the batch size is different between Adam and SGD. The orange line represents the tendency of the optimal learning rate to converge to a non-zero value when the batch size is large enough.
Figure 2: Batch size versus optimal learning rate within the context of CNN trained on FashionMNIST.
Figure 3: The relationship between batch sizes and optimal learning rates within the context of ResNet-18 trained on TinyImageNet. The red dashed line accurately predicts the peak value, and as the training loss decreases, the peak value gradually shifts to the right.
Figure 4: The relationship between batch sizes and optimal learning rates within the context of DistilGPT2 trained on Eli5Category.
Figure 5: Grid search results for the MoE jiang2024casasdai2024deepseekmoe structure model.
...and 3 more figures

Theorems & Definitions (10)

Lemma 1
Theorem 2
Theorem 3
Theorem 4
Theorem 5
proof
proof
proof
proof
proof

Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling

TL;DR

Abstract

Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (10)