AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs
Di He, Songjun Tu, Ajay Jaiswal, Li Shen, Ganzhao Yuan, Shiwei Liu, Lu Yin
TL;DR
AlphaDecay addresses the suboptimality of uniform weight decay in large language models by employing module-wise regularization guided by HT-SR theory. It quantifies each module's spectral heavy-tailedness with a Hill-estimator-derived PL_Alpha_Hill exponent and assigns per-module decay through a linear interpolation between two bounds, updated periodically. Across four model sizes (60M–1B) trained on C4, AlphaDecay consistently improves perplexity and generalization versus Uniform, AWD, and AdaDecay and transfers gains to zero-shot and finetuning tasks, as well as cross-architecture datasets. This HT-SR–driven, module-aware regularization framework offers a principled, practical path to improve transformer training without architectural changes, with code available at the authors’ repository.
Abstract
Weight decay is a standard regularization technique for training large language models (LLMs). While it is common to assign a uniform decay rate to every layer, this approach overlooks the structural diversity of LLMs and the varying spectral properties across modules. In this paper, we introduce AlphaDecay, a simple yet effective method that adaptively assigns different weight decay strengths to each module of an LLM. Our approach is guided by Heavy-Tailed Self-Regularization (HT-SR) theory, which analyzes the empirical spectral density (ESD) of weight correlation matrices to quantify "heavy-tailedness." Modules exhibiting more pronounced heavy-tailed ESDs, reflecting stronger feature learning, are assigned weaker decay, while modules with lighter-tailed spectra receive stronger decay. Our method leverages tailored weight decay assignments to balance the module-wise differences in spectral properties, leading to improved performance. Extensive pre-training tasks with various model sizes from 60M to 1B demonstrate that AlphaDecay achieves better perplexity and generalization than conventional uniform decay and other adaptive decay baselines. Our code is available at https://github.com/hed-ucas/AlphaDecay.
