Table of Contents
Fetching ...

Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models

Ya Wang, Zhijian Zhuo, Yutao Zeng, Xun Zhou, Jian Yang, Xiaoqing Li

TL;DR

Training stability for large language models remains challenging due to gradient explosion and dissipation, especially in Post-Norm Transformer architectures. The authors propose Scale-Distribution Decoupling (SDD), reformulating fully-connected layers as $y = \alpha \odot \mathrm{norm}(V x)$ to separately regulate scale with a learnable vector $\alpha$ while normalization controls activation dispersion. Theoretical analysis shows approximate expressiveness equivalence to standard layers and improved gradient conditioning, and extensive experiments on dense and MoE models show faster convergence and better downstream performance with only negligible overhead. Overall, SDD provides a lightweight, practical solution that enhances stability, scalability, and generalization in large-scale language model training.

Abstract

Training stability is a persistent challenge in the pre-training of large language models (LLMs), particularly for architectures such as Post-Norm Transformers, which are prone to gradient explosion and dissipation. In this paper, we propose Scale-Distribution Decoupling (SDD), a novel approach that stabilizes training by explicitly decoupling the scale and distribution of the weight matrix in fully-connected layers. SDD applies a normalization mechanism to regulate activations and a learnable scaling vector to maintain well-conditioned gradients, effectively preventing $\textbf{gradient explosion and dissipation}$. This separation improves optimization efficiency, particularly in deep networks, by ensuring stable gradient propagation. Experimental results demonstrate that our method stabilizes training across various LLM architectures and outperforms existing techniques in different normalization configurations. Furthermore, the proposed method is lightweight and compatible with existing frameworks, making it a practical solution for stabilizing LLM training. Code is available at https://github.com/kaihemo/SDD.

Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models

TL;DR

Training stability for large language models remains challenging due to gradient explosion and dissipation, especially in Post-Norm Transformer architectures. The authors propose Scale-Distribution Decoupling (SDD), reformulating fully-connected layers as to separately regulate scale with a learnable vector while normalization controls activation dispersion. Theoretical analysis shows approximate expressiveness equivalence to standard layers and improved gradient conditioning, and extensive experiments on dense and MoE models show faster convergence and better downstream performance with only negligible overhead. Overall, SDD provides a lightweight, practical solution that enhances stability, scalability, and generalization in large-scale language model training.

Abstract

Training stability is a persistent challenge in the pre-training of large language models (LLMs), particularly for architectures such as Post-Norm Transformers, which are prone to gradient explosion and dissipation. In this paper, we propose Scale-Distribution Decoupling (SDD), a novel approach that stabilizes training by explicitly decoupling the scale and distribution of the weight matrix in fully-connected layers. SDD applies a normalization mechanism to regulate activations and a learnable scaling vector to maintain well-conditioned gradients, effectively preventing . This separation improves optimization efficiency, particularly in deep networks, by ensuring stable gradient propagation. Experimental results demonstrate that our method stabilizes training across various LLM architectures and outperforms existing techniques in different normalization configurations. Furthermore, the proposed method is lightweight and compatible with existing frameworks, making it a practical solution for stabilizing LLM training. Code is available at https://github.com/kaihemo/SDD.

Paper Structure

This paper contains 18 sections, 27 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Training/validation loss with downstream performance on HellaSwag and PIQA for 1B dense models trained with 2T tokens: SDD-1B (Post-Norm) achieves superior convergence ($1.5\times$) and generalization over OLMo2-1B (Pre-Norm).
  • Figure 2: Comparison of vanilla and SDD-based Self-Attention /FFN Architectures. The top-left figure shows the standard self-attention module, while the top-right presents the self-attention module with SDD. Similarly, the middle figure depicts the standard feed-forward network (FFN), and the bottom shows the SDD-based FFN. In these figures, "FC" represents a fully-connected layer, and "SDD" denotes the SDD-based fully-connected layer, formulated as Eqn. \ref{['equ:SDD']}. Labels beneath "FC" and "SDD" indicate their learnable parameters. Notably, the additional parameter $\alpha$ in "SDD" is a one-dimensional vector, contributing negligible overhead.
  • Figure 3: Training and validation loss on C4 for dense models trained with 200 billion tokens. A comparison of OLMo2-1B (Pre-Norm), DeepNorm-1B (Post-Norm), PostNorm-1B (Post-Norm), and SDD-1B (Post-Norm) highlights the superior convergence and stability of SDD-1B.
  • Figure 4: Downstream performance on MMLU, HellaSwag, ARC-Challenge, and OpenbookQA for dense models trained on 200B tokens. SDD-1B consistently outperforms others, showcasing superior generalization.
  • Figure 5: Training and Validation Loss on C4 for MoE Models with 250 Billion Tokens: Comparison of OLMoE-588M-3B (Pre-Norm) and SDD-588M-3B (Post-Norm).
  • ...and 7 more figures

Theorems & Definitions (2)

  • proof
  • proof