Table of Contents
Fetching ...

Optimization Hyper-parameter Laws for Large Language Models

Xingyu Xie, Kuangyu Ding, Shuicheng Yan, Kim-Chuan Toh, Tianwen Wei

TL;DR

Opt-Laws introduce a principled, SDE-based framework to connect training hyper-parameters, especially LR schedules, with final loss in large language models. By modeling SGD and Adam as time-inhomogeneous stochastic processes and fitting a 16-dimensional optimization-feature vector, Opt-Laws enable pre-selection of LR schedules, warmup steps, and peak LR across pre-training, continual training, and fine-tuning. The approach unifies convergence speed and escape from local minima, provides divergence-prediction criteria, and generalizes to model-size effects, recovering classical scaling laws under appropriate limits. Empirically, Opt-Laws achieve high predictive accuracy (often sub-0.1% relative error) on models with billions of parameters and hundreds of billions of tokens, while reducing computational costs by enabling schedule ranking without extensive trial training. This combination of theoretical grounding and cross-stage applicability offers a practical tool for efficient, scalable hyper-parameter tuning in modern LLM workflows.

Abstract

Large Language Models have driven significant AI advancements, yet their training is resource-intensive and highly sensitive to hyper-parameter selection. While scaling laws provide valuable guidance on model size and data requirements, they fall short in choosing dynamic hyper-parameters, such as learning-rate (LR) schedules, that evolve during training. To bridge this gap, we present Optimization Hyper-parameter Laws (Opt-Laws), a framework that effectively captures the relationship between hyper-parameters and training outcomes, enabling the pre-selection of potential optimal schedules. Grounded in stochastic differential equations, Opt-Laws introduce novel mathematical interpretability and offer a robust theoretical foundation for some popular LR schedules. Our extensive validation across diverse model sizes and data scales demonstrates Opt-Laws' ability to accurately predict training loss and identify optimal LR schedule candidates in pre-training, continual training, and fine-tuning scenarios. This approach significantly reduces computational costs while enhancing overall model performance.

Optimization Hyper-parameter Laws for Large Language Models

TL;DR

Opt-Laws introduce a principled, SDE-based framework to connect training hyper-parameters, especially LR schedules, with final loss in large language models. By modeling SGD and Adam as time-inhomogeneous stochastic processes and fitting a 16-dimensional optimization-feature vector, Opt-Laws enable pre-selection of LR schedules, warmup steps, and peak LR across pre-training, continual training, and fine-tuning. The approach unifies convergence speed and escape from local minima, provides divergence-prediction criteria, and generalizes to model-size effects, recovering classical scaling laws under appropriate limits. Empirically, Opt-Laws achieve high predictive accuracy (often sub-0.1% relative error) on models with billions of parameters and hundreds of billions of tokens, while reducing computational costs by enabling schedule ranking without extensive trial training. This combination of theoretical grounding and cross-stage applicability offers a practical tool for efficient, scalable hyper-parameter tuning in modern LLM workflows.

Abstract

Large Language Models have driven significant AI advancements, yet their training is resource-intensive and highly sensitive to hyper-parameter selection. While scaling laws provide valuable guidance on model size and data requirements, they fall short in choosing dynamic hyper-parameters, such as learning-rate (LR) schedules, that evolve during training. To bridge this gap, we present Optimization Hyper-parameter Laws (Opt-Laws), a framework that effectively captures the relationship between hyper-parameters and training outcomes, enabling the pre-selection of potential optimal schedules. Grounded in stochastic differential equations, Opt-Laws introduce novel mathematical interpretability and offer a robust theoretical foundation for some popular LR schedules. Our extensive validation across diverse model sizes and data scales demonstrates Opt-Laws' ability to accurately predict training loss and identify optimal LR schedule candidates in pre-training, continual training, and fine-tuning scenarios. This approach significantly reduces computational costs while enhancing overall model performance.
Paper Structure (50 sections, 11 theorems, 94 equations, 8 figures, 6 tables)

This paper contains 50 sections, 11 theorems, 94 equations, 8 figures, 6 tables.

Key Result

Proposition 1

Let $a=r_a S$ and $a_c=r_{a_c}S$. For any $r_a>0$ and $r_{a_c}>0$ such that $0<r_a\leq r_{a_c}<1$, it holds that

Figures (8)

  • Figure 1: Contour plots of predicted perplexity, which is the exponential of the predicted training loss, versus warmup steps and peak LR for different token quantities (3B, 6B, 10B, 30B) from the RedPajama-v2 dataset.
  • Figure 2: Smoothed final training loss across various combinations of training parameters, including model sizes from $8 \times 0.001$B to $8 \times 0.3$B MoEs, peak LRs from 1e-3 to 1.5e-2, warmup steps from 128 to 6000, and data sizes of 10B and 30B tokens. Each grid point represents the loss for a specific parameter set. Divergent training runs were assigned a loss of 7, reflecting the typical plateau observed in practice.
  • Figure 3: Illustration of the criterion for predicting training divergence using a linear warmup and cooldown schedule. The areas $S_1$ (where the learning rate is below the threshold $\eta_L$) and $S_2$ (where it exceeds $\eta_L$) are compared. A ratio $S_1/S_2 > 1$ suggests stable training, while a ratio $< 1$ indicates likely divergence.
  • Figure 4: Illustration of a typical LR schedule comprising four phases: warmup, decay, plateau, and cooldown. This framework encompasses most LR schedules used in LLM training as special cases. We use this example to demonstrate the selection of the hyper-parameters $\mathbf{a}_c$ and $\mathbf{a}_e$ in \ref{['eq:opt-law2']}.
  • Figure 5: Comparison of actual training outcomes (left) and loss predictions generated by \ref{['eq:opt-law2']} (right) for a common LR schedule pattern with linear warmup and cooldown. In regions where $R(\eta_{\max}, a_1, N, S) > 1$, the divergence indicator from Eqn. \ref{['eq:criterion']}, the predicted loss is set to 7 to signify training failure. The average relative error between the predicted and actual losses is within $0.5\%$, demonstrating the accuracy of \ref{['eq:opt-law2']}
  • ...and 3 more figures

Theorems & Definitions (11)

  • Proposition 1
  • Proposition 2: Trace Boundedness
  • Proposition 3: Covariance Boundedness
  • Theorem 1: SGD Convergence Bound
  • Proposition 4: Dynamics Boundedness
  • Theorem 2: Adam Convergence Bound
  • Proposition 5: SGD-SDE Approximation
  • Proposition 6: Adam-SDE Approximation
  • Theorem 3: SGD Escape Probability
  • Theorem 4: Adam Escape Probability
  • ...and 1 more