Table of Contents
Fetching ...

An Empirical Study of $μ$P Learning Rate Transfer

Lucas Lingle

TL;DR

This work empirically evaluates μ-Transfer, the width-based μP scaling approach, for transferring learning-rate optima from small proxy transformers to very large models. Across extensive ablations and a large-scale run up to 10B parameters, learning-rate optima largely transfer, though failures arise with trainable RMSNorm gains and the standard 1/√D attention scale. The study identifies practical mitigations—nonparametric RMSNorm and a 1/D attention scale—alongside cautions about weight decay configurations that can hinder transfer. Overall, the results provide actionable guidelines for leveraging μP to reduce hyperparameter search costs in large-scale transformer training while highlighting conditions that require adjustment for robust transfer.

Abstract

Deep learning models have become a cornerstone of modern AI research, yet their initializations and learning rates may at times be set in an opaque or ad-hoc fashion due to the high cost of hyperparameter sweeps. The $μ$-Parameterization ($μ$P) offers a possible solution to this challenge, yielding scaling rules for model initialization and learning rates while reportedly enabling zero-shot hyperparameter transfer from small to large models. Despite its evident promise, the $μ$P method is not yet widely adopted, perhaps due to higher implementation complexity, many variations, or complex theoretical background. This work considers $μ$P empirically, focusing on the popular transformer architecture, and aims to answer a simple question: does $μ$-Transfer yield near-optimal learning rates in practice? Studying over a dozen ablations with up to 1.2B parameters and 33B tokens and a large-scale experiment with up to 10B parameters and 190B tokens, we observe a positive answer for most settings, and discuss improvements otherwise.

An Empirical Study of $μ$P Learning Rate Transfer

TL;DR

This work empirically evaluates μ-Transfer, the width-based μP scaling approach, for transferring learning-rate optima from small proxy transformers to very large models. Across extensive ablations and a large-scale run up to 10B parameters, learning-rate optima largely transfer, though failures arise with trainable RMSNorm gains and the standard 1/√D attention scale. The study identifies practical mitigations—nonparametric RMSNorm and a 1/D attention scale—alongside cautions about weight decay configurations that can hinder transfer. Overall, the results provide actionable guidelines for leveraging μP to reduce hyperparameter search costs in large-scale transformer training while highlighting conditions that require adjustment for robust transfer.

Abstract

Deep learning models have become a cornerstone of modern AI research, yet their initializations and learning rates may at times be set in an opaque or ad-hoc fashion due to the high cost of hyperparameter sweeps. The -Parameterization (P) offers a possible solution to this challenge, yielding scaling rules for model initialization and learning rates while reportedly enabling zero-shot hyperparameter transfer from small to large models. Despite its evident promise, the P method is not yet widely adopted, perhaps due to higher implementation complexity, many variations, or complex theoretical background. This work considers P empirically, focusing on the popular transformer architecture, and aims to answer a simple question: does -Transfer yield near-optimal learning rates in practice? Studying over a dozen ablations with up to 1.2B parameters and 33B tokens and a large-scale experiment with up to 10B parameters and 190B tokens, we observe a positive answer for most settings, and discuss improvements otherwise.
Paper Structure (38 sections, 4 equations, 11 tables)