Optimization Hyper-parameter Laws for Large Language Models

Xingyu Xie; Kuangyu Ding; Shuicheng Yan; Kim-Chuan Toh; Tianwen Wei

Optimization Hyper-parameter Laws for Large Language Models

Xingyu Xie, Kuangyu Ding, Shuicheng Yan, Kim-Chuan Toh, Tianwen Wei

TL;DR

Opt-Laws introduce a principled, SDE-based framework to connect training hyper-parameters, especially LR schedules, with final loss in large language models. By modeling SGD and Adam as time-inhomogeneous stochastic processes and fitting a 16-dimensional optimization-feature vector, Opt-Laws enable pre-selection of LR schedules, warmup steps, and peak LR across pre-training, continual training, and fine-tuning. The approach unifies convergence speed and escape from local minima, provides divergence-prediction criteria, and generalizes to model-size effects, recovering classical scaling laws under appropriate limits. Empirically, Opt-Laws achieve high predictive accuracy (often sub-0.1% relative error) on models with billions of parameters and hundreds of billions of tokens, while reducing computational costs by enabling schedule ranking without extensive trial training. This combination of theoretical grounding and cross-stage applicability offers a practical tool for efficient, scalable hyper-parameter tuning in modern LLM workflows.

Abstract

Large Language Models have driven significant AI advancements, yet their training is resource-intensive and highly sensitive to hyper-parameter selection. While scaling laws provide valuable guidance on model size and data requirements, they fall short in choosing dynamic hyper-parameters, such as learning-rate (LR) schedules, that evolve during training. To bridge this gap, we present Optimization Hyper-parameter Laws (Opt-Laws), a framework that effectively captures the relationship between hyper-parameters and training outcomes, enabling the pre-selection of potential optimal schedules. Grounded in stochastic differential equations, Opt-Laws introduce novel mathematical interpretability and offer a robust theoretical foundation for some popular LR schedules. Our extensive validation across diverse model sizes and data scales demonstrates Opt-Laws' ability to accurately predict training loss and identify optimal LR schedule candidates in pre-training, continual training, and fine-tuning scenarios. This approach significantly reduces computational costs while enhancing overall model performance.

Optimization Hyper-parameter Laws for Large Language Models

TL;DR

Abstract

Paper Structure (50 sections, 11 theorems, 94 equations, 8 figures, 6 tables)

This paper contains 50 sections, 11 theorems, 94 equations, 8 figures, 6 tables.

Introduction
Related Work
Scaling Laws
Convergence Analysis via Dynamical Systems
Escaping analysis via SDEs
Optimization Hyper-parameter Laws
Opt-Laws with Fixed Model Size
Opt-Laws Parameters Fitting
Understanding Training Phenomena through Opt-Laws
Influence of Warmup Steps on Training Loss
Insights into LR schedule Effects through Opt-Laws
Extension to the General Cases
Predicting Training Divergence
Generalized Opt-Laws
Discussion
...and 35 more sections

Key Result

Proposition 1

Let $a=r_a S$ and $a_c=r_{a_c}S$. For any $r_a>0$ and $r_{a_c}>0$ such that $0<r_a\leq r_{a_c}<1$, it holds that

Figures (8)

Figure 1: Contour plots of predicted perplexity, which is the exponential of the predicted training loss, versus warmup steps and peak LR for different token quantities (3B, 6B, 10B, 30B) from the RedPajama-v2 dataset.
Figure 2: Smoothed final training loss across various combinations of training parameters, including model sizes from $8 \times 0.001$B to $8 \times 0.3$B MoEs, peak LRs from 1e-3 to 1.5e-2, warmup steps from 128 to 6000, and data sizes of 10B and 30B tokens. Each grid point represents the loss for a specific parameter set. Divergent training runs were assigned a loss of 7, reflecting the typical plateau observed in practice.
Figure 3: Illustration of the criterion for predicting training divergence using a linear warmup and cooldown schedule. The areas $S_1$ (where the learning rate is below the threshold $\eta_L$) and $S_2$ (where it exceeds $\eta_L$) are compared. A ratio $S_1/S_2 > 1$ suggests stable training, while a ratio $< 1$ indicates likely divergence.
Figure 4: Illustration of a typical LR schedule comprising four phases: warmup, decay, plateau, and cooldown. This framework encompasses most LR schedules used in LLM training as special cases. We use this example to demonstrate the selection of the hyper-parameters $\mathbf{a}_c$ and $\mathbf{a}_e$ in \ref{['eq:opt-law2']}.
Figure 5: Comparison of actual training outcomes (left) and loss predictions generated by \ref{['eq:opt-law2']} (right) for a common LR schedule pattern with linear warmup and cooldown. In regions where $R(\eta_{\max}, a_1, N, S) > 1$, the divergence indicator from Eqn. \ref{['eq:criterion']}, the predicted loss is set to 7 to signify training failure. The average relative error between the predicted and actual losses is within $0.5\%$, demonstrating the accuracy of \ref{['eq:opt-law2']}
...and 3 more figures

Theorems & Definitions (11)

Proposition 1
Proposition 2: Trace Boundedness
Proposition 3: Covariance Boundedness
Theorem 1: SGD Convergence Bound
Proposition 4: Dynamics Boundedness
Theorem 2: Adam Convergence Bound
Proposition 5: SGD-SDE Approximation
Proposition 6: Adam-SDE Approximation
Theorem 3: SGD Escape Probability
Theorem 4: Adam Escape Probability
...and 1 more

Optimization Hyper-parameter Laws for Large Language Models

TL;DR

Abstract

Optimization Hyper-parameter Laws for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (11)