Table of Contents
Fetching ...

A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes

Zachary Nado, Justin M. Gilmer, Christopher J. Shallue, Rohan Anil, George E. Dahl

TL;DR

The work challenges the necessity of specialized large-batch optimizers (LARS/LAMB) by showing that conventional methods like Nesterov momentum and Adam can achieve comparable or better results on ImageNet and BERT pretraining when hyperparameters and regularization are carefully tuned. It provides strong baselines for large-batch training, reveals the critical influence of learning-rate schedules and BN/regularization choices, and demonstrates that improvements can arise from tuning rather than novel update rules. The findings urge rigorous, standardized comparisons and transparent reporting of tuning efforts, arguing that any claimed optimizer advantage must be demonstrated against well-tuned, fair baselines. The work thus reshapes how researchers should evaluate optimizers for neural network training at scale, emphasizing practical impact and reproducibility over novelty alone.

Abstract

Recently the LARS and LAMB optimizers have been proposed for training neural networks faster using large batch sizes. LARS and LAMB add layer-wise normalization to the update rules of Heavy-ball momentum and Adam, respectively, and have become popular in prominent benchmarks and deep learning libraries. However, without fair comparisons to standard optimizers, it remains an open question whether LARS and LAMB have any benefit over traditional, generic algorithms. In this work we demonstrate that standard optimization algorithms such as Nesterov momentum and Adam can match or exceed the results of LARS and LAMB at large batch sizes. Our results establish new, stronger baselines for future comparisons at these batch sizes and shed light on the difficulties of comparing optimizers for neural network training more generally.

A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes

TL;DR

The work challenges the necessity of specialized large-batch optimizers (LARS/LAMB) by showing that conventional methods like Nesterov momentum and Adam can achieve comparable or better results on ImageNet and BERT pretraining when hyperparameters and regularization are carefully tuned. It provides strong baselines for large-batch training, reveals the critical influence of learning-rate schedules and BN/regularization choices, and demonstrates that improvements can arise from tuning rather than novel update rules. The findings urge rigorous, standardized comparisons and transparent reporting of tuning efforts, arguing that any claimed optimizer advantage must be demonstrated against well-tuned, fair baselines. The work thus reshapes how researchers should evaluate optimizers for neural network training at scale, emphasizing practical impact and reproducibility over novelty alone.

Abstract

Recently the LARS and LAMB optimizers have been proposed for training neural networks faster using large batch sizes. LARS and LAMB add layer-wise normalization to the update rules of Heavy-ball momentum and Adam, respectively, and have become popular in prominent benchmarks and deep learning libraries. However, without fair comparisons to standard optimizers, it remains an open question whether LARS and LAMB have any benefit over traditional, generic algorithms. In this work we demonstrate that standard optimization algorithms such as Nesterov momentum and Adam can match or exceed the results of LARS and LAMB at large batch sizes. Our results establish new, stronger baselines for future comparisons at these batch sizes and shed light on the difficulties of comparing optimizers for neural network training more generally.

Paper Structure

This paper contains 24 sections, 16 equations, 4 figures, 23 tables.

Figures (4)

  • Figure 1: The learning rate schedules of LARS and Nesterov momentum Configuration B. Aside from re-scaling, the only difference is setting the warmup polynomial power to 2 instead of 1.
  • Figure 2: An illustration of the sudden drop in the BERT learning rate schedule in the official codebase.
  • Figure 3: 6 finetuning runs starting from the same pretraining checkpoint to show the stability of our results, at each of the 32,768, mixed 65,536-32,768, and 65,536 batch size settings.
  • Figure 4: Distributions over 50 training runs for each ablation study around our best Nesterov momentum configuration (Configuration A). The dotted red line is at the target accuracy of 75.9%, and the boxes show the min, max, and quartiles of the distribution of accuracies over the 50 training runs.

Theorems & Definitions (1)

  • proof