Table of Contents
Fetching ...

AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining

Hongyuan Dong, Dingkang Yang, Xiao Liang, Chao Feng, Jiao Ran

TL;DR

AdaLRS introduces an online, loss-descent-velocity-guided learning rate search for foundation model pretraining. By jointly considering the training loss and its velocity with respect to the LR, and updating the LR through multiplicative upscaling and downscaling, AdaLRS achieves convergence guarantees and often reaches near-optimal learning rates in a single run. Theoretical analysis shows convexity of loss and its slope and geometric error decay, while empirical results on LLM and VLM tasks demonstrate accelerated convergence and improved performance across diverse model sizes and schedulers. The work also provides ablations and continual pretraining experiments, highlighting both robustness and practical limitations in extremely large-LR regimes.

Abstract

Learning rate is widely regarded as crucial for effective foundation model pretraining. Recent research explores and demonstrates the transferability of learning rate configurations across varying model and dataset sizes, etc. Nevertheless, these approaches are constrained to specific training scenarios and typically necessitate extensive hyperparameter tuning on proxy models. In this work, we propose \textbf{AdaLRS}, a plug-in-and-play adaptive learning rate search algorithm that conducts online optimal learning rate search via optimizing loss descent velocities. We provide theoretical and experimental analyzes to show that foundation model pretraining loss and its descent velocity are both convex and share the same optimal learning rate. Relying solely on training loss dynamics, AdaLRS involves few extra computations to guide the search process, and its convergence is guaranteed via theoretical analysis. Experiments on both LLM and VLM pretraining show that AdaLRS adjusts suboptimal learning rates to the neighborhood of optimum with marked efficiency and effectiveness, with model performance improved accordingly. We also show the robust generalizability of AdaLRS across varying training scenarios, such as different model sizes, training paradigms, base learning rate scheduler choices, and hyperparameter settings.

AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining

TL;DR

AdaLRS introduces an online, loss-descent-velocity-guided learning rate search for foundation model pretraining. By jointly considering the training loss and its velocity with respect to the LR, and updating the LR through multiplicative upscaling and downscaling, AdaLRS achieves convergence guarantees and often reaches near-optimal learning rates in a single run. Theoretical analysis shows convexity of loss and its slope and geometric error decay, while empirical results on LLM and VLM tasks demonstrate accelerated convergence and improved performance across diverse model sizes and schedulers. The work also provides ablations and continual pretraining experiments, highlighting both robustness and practical limitations in extremely large-LR regimes.

Abstract

Learning rate is widely regarded as crucial for effective foundation model pretraining. Recent research explores and demonstrates the transferability of learning rate configurations across varying model and dataset sizes, etc. Nevertheless, these approaches are constrained to specific training scenarios and typically necessitate extensive hyperparameter tuning on proxy models. In this work, we propose \textbf{AdaLRS}, a plug-in-and-play adaptive learning rate search algorithm that conducts online optimal learning rate search via optimizing loss descent velocities. We provide theoretical and experimental analyzes to show that foundation model pretraining loss and its descent velocity are both convex and share the same optimal learning rate. Relying solely on training loss dynamics, AdaLRS involves few extra computations to guide the search process, and its convergence is guaranteed via theoretical analysis. Experiments on both LLM and VLM pretraining show that AdaLRS adjusts suboptimal learning rates to the neighborhood of optimum with marked efficiency and effectiveness, with model performance improved accordingly. We also show the robust generalizability of AdaLRS across varying training scenarios, such as different model sizes, training paradigms, base learning rate scheduler choices, and hyperparameter settings.

Paper Structure

This paper contains 19 sections, 5 theorems, 13 equations, 4 figures, 4 tables, 1 algorithm.

Key Result

Theorem 2.1

The learning rate sequence $\{\eta_t\}$ generated by the algorithm converges almost surely to the $e$-neighborhood of the optimal learning rate $N_e(\eta^*) \triangleq \{\eta : |\eta - \eta^*| < e\}$, i.e.,

Figures (4)

  • Figure 1: Training loss and loss descent velocity dynamics w.r.t. varying LR settings for LLM and VLM pretraining. Figures (a)&(b)&(c) show the training losses and LR trajectories with a cosine LRS, while Figures (d)&(e)&(f) illustrate how training loss varies across different LR settings through the training process. Figures (g)&(h)&(i) are loss slope dynamics at varying loss levels, obtained from experiments with constant learning rates.
  • Figure 2: AdaLRS's Learning rate adjustment process in foundation model pretraining under differnt LR settings. "Fit LR‚Äù refers to learning rate appropriate for the pretraining task estimated by pilot study results. Dashed curves represent failed LR upscaling attempts.
  • Figure 3: AdaLRS's Learning rate adjustment process in 2B LLM pretraining with WSD scheduler. We refer to Figure \ref{['fig:main']} for denotation definitions.
  • Figure 4: Ablation studies for the backtracking downscaling strategy (a) and the training dynamics of AdaLRS on VLM continual pretraining (b)(c).

Theorems & Definitions (5)

  • Theorem 2.1
  • Proposition 2.2
  • Proposition 2.3
  • Theorem 2.4
  • Theorem B.1