Table of Contents
Fetching ...

AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs

Di He, Songjun Tu, Ajay Jaiswal, Li Shen, Ganzhao Yuan, Shiwei Liu, Lu Yin

TL;DR

AlphaDecay addresses the suboptimality of uniform weight decay in large language models by employing module-wise regularization guided by HT-SR theory. It quantifies each module's spectral heavy-tailedness with a Hill-estimator-derived PL_Alpha_Hill exponent and assigns per-module decay through a linear interpolation between two bounds, updated periodically. Across four model sizes (60M–1B) trained on C4, AlphaDecay consistently improves perplexity and generalization versus Uniform, AWD, and AdaDecay and transfers gains to zero-shot and finetuning tasks, as well as cross-architecture datasets. This HT-SR–driven, module-aware regularization framework offers a principled, practical path to improve transformer training without architectural changes, with code available at the authors’ repository.

Abstract

Weight decay is a standard regularization technique for training large language models (LLMs). While it is common to assign a uniform decay rate to every layer, this approach overlooks the structural diversity of LLMs and the varying spectral properties across modules. In this paper, we introduce AlphaDecay, a simple yet effective method that adaptively assigns different weight decay strengths to each module of an LLM. Our approach is guided by Heavy-Tailed Self-Regularization (HT-SR) theory, which analyzes the empirical spectral density (ESD) of weight correlation matrices to quantify "heavy-tailedness." Modules exhibiting more pronounced heavy-tailed ESDs, reflecting stronger feature learning, are assigned weaker decay, while modules with lighter-tailed spectra receive stronger decay. Our method leverages tailored weight decay assignments to balance the module-wise differences in spectral properties, leading to improved performance. Extensive pre-training tasks with various model sizes from 60M to 1B demonstrate that AlphaDecay achieves better perplexity and generalization than conventional uniform decay and other adaptive decay baselines. Our code is available at https://github.com/hed-ucas/AlphaDecay.

AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs

TL;DR

AlphaDecay addresses the suboptimality of uniform weight decay in large language models by employing module-wise regularization guided by HT-SR theory. It quantifies each module's spectral heavy-tailedness with a Hill-estimator-derived PL_Alpha_Hill exponent and assigns per-module decay through a linear interpolation between two bounds, updated periodically. Across four model sizes (60M–1B) trained on C4, AlphaDecay consistently improves perplexity and generalization versus Uniform, AWD, and AdaDecay and transfers gains to zero-shot and finetuning tasks, as well as cross-architecture datasets. This HT-SR–driven, module-aware regularization framework offers a principled, practical path to improve transformer training without architectural changes, with code available at the authors’ repository.

Abstract

Weight decay is a standard regularization technique for training large language models (LLMs). While it is common to assign a uniform decay rate to every layer, this approach overlooks the structural diversity of LLMs and the varying spectral properties across modules. In this paper, we introduce AlphaDecay, a simple yet effective method that adaptively assigns different weight decay strengths to each module of an LLM. Our approach is guided by Heavy-Tailed Self-Regularization (HT-SR) theory, which analyzes the empirical spectral density (ESD) of weight correlation matrices to quantify "heavy-tailedness." Modules exhibiting more pronounced heavy-tailed ESDs, reflecting stronger feature learning, are assigned weaker decay, while modules with lighter-tailed spectra receive stronger decay. Our method leverages tailored weight decay assignments to balance the module-wise differences in spectral properties, leading to improved performance. Extensive pre-training tasks with various model sizes from 60M to 1B demonstrate that AlphaDecay achieves better perplexity and generalization than conventional uniform decay and other adaptive decay baselines. Our code is available at https://github.com/hed-ucas/AlphaDecay.

Paper Structure

This paper contains 28 sections, 9 equations, 19 figures, 13 tables, 1 algorithm.

Figures (19)

  • Figure 1: Module-wise Balance and AlphaDecay weight decay schedule. (a) Employing PL fitting to derive module-wise PL_Alpha_Hill values (see formula (\ref{['fomula:PL alpha hill']})), AlphaDecay achieves module-wise balance by increasing the lower values (e.g., att.Q and att.K, more heavy-tailed) while decreasing the higher values (e.g., MLP components, less heavy-tailed). (b) Given the imbalanced module-wise PL_Alpha_Hill of LLaMa-60M, AlphaDecay assigns lower weight decay to modules with lower PL_Alpha_Hill.
  • Figure 2: Visualization of singular values from weight matrices in each layer of the pretrained https://huggingface.co/meta-LLaMa/LLaMa-2-13b-hf model. For all 40 transformer layers, the plots show the sorted distribution of 5120 singular values per layer.
  • Figure 3: Comparison of ESD distributions across modules of LLaMa-135M under different training methods (AlphaDecay: Perplexity=22.55 vs. Uniform: Perplexity=23.14). Attention-related modules (e.g., att.q, att.k) exhibit notably heavier spectral tails in contrast to MLP-associated modules. Our method systematically balances the heavy-tailed properties across modules by appropriately configuring module-wise weight decay, thereby enhancing overall model performance.
  • Figure 4: Comparison of perplexity and module-wise PL_Alpha_Hill values of LLaMa-135M under varying weight decay settings; For each group, att.q/k shows the mean PL_Alpha_Hill of att.q and att.k; att.v/o shows the mean for att.v and att.o; mlp is the mean of mlp.gate, mlp.up, and mlp.down. Shaded areas indicate the range between the maximum and minimum values within each group.
  • Figure 5: Weight Decay = 1e-5
  • ...and 14 more figures