Optimization Hyper-parameter Laws for Large Language Models
Xingyu Xie, Kuangyu Ding, Shuicheng Yan, Kim-Chuan Toh, Tianwen Wei
TL;DR
Opt-Laws introduce a principled, SDE-based framework to connect training hyper-parameters, especially LR schedules, with final loss in large language models. By modeling SGD and Adam as time-inhomogeneous stochastic processes and fitting a 16-dimensional optimization-feature vector, Opt-Laws enable pre-selection of LR schedules, warmup steps, and peak LR across pre-training, continual training, and fine-tuning. The approach unifies convergence speed and escape from local minima, provides divergence-prediction criteria, and generalizes to model-size effects, recovering classical scaling laws under appropriate limits. Empirically, Opt-Laws achieve high predictive accuracy (often sub-0.1% relative error) on models with billions of parameters and hundreds of billions of tokens, while reducing computational costs by enabling schedule ranking without extensive trial training. This combination of theoretical grounding and cross-stage applicability offers a practical tool for efficient, scalable hyper-parameter tuning in modern LLM workflows.
Abstract
Large Language Models have driven significant AI advancements, yet their training is resource-intensive and highly sensitive to hyper-parameter selection. While scaling laws provide valuable guidance on model size and data requirements, they fall short in choosing dynamic hyper-parameters, such as learning-rate (LR) schedules, that evolve during training. To bridge this gap, we present Optimization Hyper-parameter Laws (Opt-Laws), a framework that effectively captures the relationship between hyper-parameters and training outcomes, enabling the pre-selection of potential optimal schedules. Grounded in stochastic differential equations, Opt-Laws introduce novel mathematical interpretability and offer a robust theoretical foundation for some popular LR schedules. Our extensive validation across diverse model sizes and data scales demonstrates Opt-Laws' ability to accurately predict training loss and identify optimal LR schedule candidates in pre-training, continual training, and fine-tuning scenarios. This approach significantly reduces computational costs while enhancing overall model performance.
