Table of Contents
Fetching ...

No More Adam: Learning Rate Scaling at Initialization is All You Need

Minghao Xu, Lichuan Xiang, Xu Cai, Hongkai Wen

TL;DR

The paper argues that adaptive gradient methods are not strictly necessary for training large Transformer models. It introduces SGD-SaI, which uses learning-rate scaling at initialization guided by per-block g-SNR, replacing second-order momentum with a constant, initialization-time preconditioner. Empirical results across LLM and ViT pretraining, PEFT tasks, and CNN benchmarks show SGD-SaI achieves competitive or superior performance to AdamW with substantially lower memory usage and comparable or faster update steps, demonstrating robustness to hyperparameters. This approach offers a simple, memory-efficient alternative for scalable transformer training with wide practical impact.

Abstract

In this work, we question the necessity of adaptive gradient methods for training deep neural networks. SGD-SaI is a simple yet effective enhancement to stochastic gradient descent with momentum (SGDM). SGD-SaI performs learning rate Scaling at Initialization (SaI) to distinct parameter groups, guided by their respective gradient signal-to-noise ratios (g-SNR). By adjusting learning rates without relying on adaptive second-order momentum, SGD-SaI helps prevent training imbalances from the very first iteration and cuts the optimizer's memory usage by half compared to AdamW. Despite its simplicity and efficiency, SGD-SaI consistently matches or outperforms AdamW in training a variety of Transformer-based tasks, effectively overcoming a long-standing challenge of using SGD for training Transformers. SGD-SaI excels in ImageNet-1K classification with Vision Transformers(ViT) and GPT-2 pretraining for large language models (LLMs, transformer decoder-only), demonstrating robustness to hyperparameter variations and practicality for diverse applications. We further tested its robustness on tasks like LoRA fine-tuning for LLMs and diffusion models, where it consistently outperforms state-of-the-art optimizers. From a memory efficiency perspective, SGD-SaI achieves substantial memory savings for optimizer states, reducing memory usage by 5.93 GB for GPT-2 (1.5B parameters) and 25.15 GB for Llama2-7B compared to AdamW in full-precision training settings.

No More Adam: Learning Rate Scaling at Initialization is All You Need

TL;DR

The paper argues that adaptive gradient methods are not strictly necessary for training large Transformer models. It introduces SGD-SaI, which uses learning-rate scaling at initialization guided by per-block g-SNR, replacing second-order momentum with a constant, initialization-time preconditioner. Empirical results across LLM and ViT pretraining, PEFT tasks, and CNN benchmarks show SGD-SaI achieves competitive or superior performance to AdamW with substantially lower memory usage and comparable or faster update steps, demonstrating robustness to hyperparameters. This approach offers a simple, memory-efficient alternative for scalable transformer training with wide practical impact.

Abstract

In this work, we question the necessity of adaptive gradient methods for training deep neural networks. SGD-SaI is a simple yet effective enhancement to stochastic gradient descent with momentum (SGDM). SGD-SaI performs learning rate Scaling at Initialization (SaI) to distinct parameter groups, guided by their respective gradient signal-to-noise ratios (g-SNR). By adjusting learning rates without relying on adaptive second-order momentum, SGD-SaI helps prevent training imbalances from the very first iteration and cuts the optimizer's memory usage by half compared to AdamW. Despite its simplicity and efficiency, SGD-SaI consistently matches or outperforms AdamW in training a variety of Transformer-based tasks, effectively overcoming a long-standing challenge of using SGD for training Transformers. SGD-SaI excels in ImageNet-1K classification with Vision Transformers(ViT) and GPT-2 pretraining for large language models (LLMs, transformer decoder-only), demonstrating robustness to hyperparameter variations and practicality for diverse applications. We further tested its robustness on tasks like LoRA fine-tuning for LLMs and diffusion models, where it consistently outperforms state-of-the-art optimizers. From a memory efficiency perspective, SGD-SaI achieves substantial memory savings for optimizer states, reducing memory usage by 5.93 GB for GPT-2 (1.5B parameters) and 25.15 GB for Llama2-7B compared to AdamW in full-precision training settings.

Paper Structure

This paper contains 27 sections, 19 equations, 15 figures, 6 tables, 6 algorithms.

Figures (15)

  • Figure 1: The chart illustrates how memory usage and optimizer step time (in wall-clock time) increase with larger model sizes. It highlights the substantial memory overhead of storing optimizer states as model sizes grow. SGD-SaI exhibits significantly lower memory usage than AdamW and has the shortest optimization step runtime. This runtime refers to the wall clock time required for the optimizer step function. All statistics were measured on a single NVIDIA A100-80GB.
  • Figure 2: This graph illustrates the differences in local gain behaviours exhibited by four optimizers throughout the training process. We present two popular adaptive gradient methods: Adam(W) and the memory-efficient Adam-mini. The local gains for these methods are recalculated continuously at each step based on the gradients. In contrast, SGD and SGD-SaI are both non-adaptive methods, meaning their local gains remain fixed throughout the training.
  • Figure 3: We observe that the g-SNR varies across different parameter blocks. However, for most weights, the parameter blocks that share the same structure across different transformer layers (blocks) tend to have similar g-SNR values. Additionally, the g-SNR values for the bias parameters are consistently low magnitude. Our method can be viewed as partitioning all parameter blocks based on their structure.
  • Figure 4: We plot the g-SNR distribution over time for three different transformer blocks: shallow (block 0), middle (block 5), and deep (block 11). Additionally, we analyze some distinct types of parameter blocks. Our observations indicate that while the g-SNR values vary across different parameter blocks, they tend to remain relatively constant over time.
  • Figure 5: Comparison of top-1 test accuracy distributions for CNNs on CIFAR-10 (Left) and ViTs on ImageNet-1k (Right) across different hyperparameter combinations. Each method demonstrates distinct performance trends, including Adam, AdamW, SGD, and SGD-SaI. Adam-Mini is only compared in the ViT case as its modification target on transformer training. SGD-SaI consistently shows enhanced robustness and performance under varying hyperparameter settings.
  • ...and 10 more figures