Table of Contents
Fetching ...

Revisiting LARS for Large Batch Training Generalization of Neural Networks

Khoi Do, Duong Nguyen, Hoa Nguyen, Long Tran-Thanh, Nguyen-Hoang Tran, Quoc-Viet Pham

TL;DR

The paper tackles instability in large-batch training caused by sharp minimizers and high Layer-wise Normalized Rate (LNR) in LARS/LAMB. It introduces Time-Varying LARS (TVLARS), which replaces warm-up with a time-varying decay phi(t) = \frac{1}{\alpha + \exp(\lambda(t - d_e))} + \gamma_{min} that starts with a high LR and gradually shifts toward LARS-like behavior. Through extensive experiments on CIFAR-10, Tiny ImageNet, and Barlow Twins SSL, TVLARS delivers up to 2% improvements in classification and up to 10% in SSL, while accelerating convergence relative to LARS/LAMB. The results demonstrate TVLARS as a practical, robust method for scalable, large-batch training in supervised and self-supervised settings, with clear guidance on decay, LR, and initialization effects.

Abstract

This paper explores Large Batch Training techniques using layer-wise adaptive scaling ratio (LARS) across diverse settings, uncovering insights. LARS algorithms with warm-up tend to be trapped in sharp minimizers early on due to redundant ratio scaling. Additionally, a fixed steep decline in the latter phase restricts deep neural networks from effectively navigating early-phase sharp minimizers. Building on these findings, we propose Time Varying LARS (TVLARS), a novel algorithm that replaces warm-up with a configurable sigmoid-like function for robust training in the initial phase. TVLARS promotes gradient exploration early on, surpassing sharp optimizers and gradually transitioning to LARS for robustness in later phases. Extensive experiments demonstrate that TVLARS consistently outperforms LARS and LAMB in most cases, with up to 2\% improvement in classification scenarios. Notably, in all self-supervised learning cases, TVLARS dominates LARS and LAMB with performance improvements of up to 10\%.

Revisiting LARS for Large Batch Training Generalization of Neural Networks

TL;DR

The paper tackles instability in large-batch training caused by sharp minimizers and high Layer-wise Normalized Rate (LNR) in LARS/LAMB. It introduces Time-Varying LARS (TVLARS), which replaces warm-up with a time-varying decay phi(t) = \frac{1}{\alpha + \exp(\lambda(t - d_e))} + \gamma_{min} that starts with a high LR and gradually shifts toward LARS-like behavior. Through extensive experiments on CIFAR-10, Tiny ImageNet, and Barlow Twins SSL, TVLARS delivers up to 2% improvements in classification and up to 10% in SSL, while accelerating convergence relative to LARS/LAMB. The results demonstrate TVLARS as a practical, robust method for scalable, large-batch training in supervised and self-supervised settings, with clear guidance on decay, LR, and initialization effects.

Abstract

This paper explores Large Batch Training techniques using layer-wise adaptive scaling ratio (LARS) across diverse settings, uncovering insights. LARS algorithms with warm-up tend to be trapped in sharp minimizers early on due to redundant ratio scaling. Additionally, a fixed steep decline in the latter phase restricts deep neural networks from effectively navigating early-phase sharp minimizers. Building on these findings, we propose Time Varying LARS (TVLARS), a novel algorithm that replaces warm-up with a configurable sigmoid-like function for robust training in the initial phase. TVLARS promotes gradient exploration early on, surpassing sharp optimizers and gradually transitioning to LARS for robustness in later phases. Extensive experiments demonstrate that TVLARS consistently outperforms LARS and LAMB in most cases, with up to 2\% improvement in classification scenarios. Notably, in all self-supervised learning cases, TVLARS dominates LARS and LAMB with performance improvements of up to 10\%.
Paper Structure (16 sections, 1 theorem, 13 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 16 sections, 1 theorem, 13 equations, 8 figures, 1 table, 1 algorithm.

Key Result

Theorem IV.2

Given $\Bar{g}^t$ as mentioned in Definition def:single-gradient, $g^t_\mathcal{B}$ is the batch gradient with batch size $\mathcal{B}$. Given $\sigma^2$ is the variance for point-wise unbiased gradient as mentioned in 2020-FL-FedNova, we have the stochastic gradient with $\mathcal{B}$ is an unbiase

Figures (8)

  • Figure 1: Scaling the learning rate in two different strategies.
  • Figure 2: Comparison between LARS trained with and without warm-up.
  • Figure 3: This figure illustrated the quantitative performance of LARS ($B = 16\rm{K})$ conducted with a warm-up and without a warm-up strategy (NOWA-LARS). Each figure contains 4 subfigures, which indicate the LWN $\Vert w \Vert$, LGN $\Vert\nabla \mathcal{L}(w)\Vert$, and LNR $\Vert w \Vert/ \Vert\nabla \mathcal{L}(w)\Vert$ of all layers, and test loss value in the y axis.
  • Figure 4: Illustration of gradient descent behavior from the perspective of model parameter hypersphere.
  • Figure 5: The decay plot of TVLARS algorithm under different settings.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Definition IV.1
  • Theorem IV.2: Unbiased Large Batch Gradient
  • proof
  • proof