Table of Contents
Fetching ...

The Curse of Depth in Large Language Models

Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, Shiwei Liu

TL;DR

The work identifies a Curse of Depth in large language models where deep Transformer layers contribute less due to variance explosion under Pre-Layer Normalization. It provides a theoretical and empirical analysis showing why deep blocks become nearly identity mappings and proposes LayerNorm Scaling, which applies a $1/\sqrt{\ell}$ factor to LayerNorm outputs to stabilize variance growth. Across model scales from 130M to 7B and architectures including ViT, LNS consistently improves pre-training perplexity, transfer to downstream tasks, and training stability, while reducing the risk of loss spikes. The approach is hyperparameter-free, easy to implement, and demonstrates robust gains in both language and vision transformers, suggesting broad practical impact for efficient, deeper neural architectures.

Abstract

In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern Large Language Models (LLMs) where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling (LNS), which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Across a wide range of model sizes (130M to 7B), our experiments show that LNS consistently outperforms previous normalization and scaling techniques in enhancing LLM pre-training performance. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training. Our code is available at \href{https://github.com/lmsdss/LayerNorm-Scaling}{LayerNorm-Scaling}.

The Curse of Depth in Large Language Models

TL;DR

The work identifies a Curse of Depth in large language models where deep Transformer layers contribute less due to variance explosion under Pre-Layer Normalization. It provides a theoretical and empirical analysis showing why deep blocks become nearly identity mappings and proposes LayerNorm Scaling, which applies a factor to LayerNorm outputs to stabilize variance growth. Across model scales from 130M to 7B and architectures including ViT, LNS consistently improves pre-training perplexity, transfer to downstream tasks, and training stability, while reducing the risk of loss spikes. The approach is hyperparameter-free, easy to implement, and demonstrates robust gains in both language and vision transformers, suggesting broad practical impact for efficient, deeper neural architectures.

Abstract

In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern Large Language Models (LLMs) where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling (LNS), which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Across a wide range of model sizes (130M to 7B), our experiments show that LNS consistently outperforms previous normalization and scaling techniques in enhancing LLM pre-training performance. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training. Our code is available at \href{https://github.com/lmsdss/LayerNorm-Scaling}{LayerNorm-Scaling}.

Paper Structure

This paper contains 36 sections, 7 theorems, 93 equations, 10 figures, 8 tables.

Key Result

Lemma 3.2

Let $\sigma^2_{x^\prime_\ell}$ and $\sigma^2_{x_\ell}$ denote the variances of $x^\prime_\ell$ and $x_\ell$, respectively. These two variances exhibit the same overall growth trend, which is: where the growth of $\sigma^2_{x_\ell}$ is sub-exponential, as shown by the following bounds:

Figures (10)

  • Figure 1: Left: Schematic diagrams of (a) Pre-LN and (b) LayerNorm Scaling. LayerNorm Scaling applies a scaling factor inversely proportional to the square root of the layer index $\ell$, preventing excessive variance growth. Right: Language modeling loss of scaling up parameter count up to 7B. All models are trained for 20B tokens using OLMo groeneveld2024olmo.
  • Figure 2: Results of open-weight large-scale LLMs. Top: Performance drop after removing a single layer without fine-tuning. Bottom: Angular distance from the initial layer $\ell$ (x-axis) and its subsequent $n^{\text{th}}$ layer (y-axis). The results demonstrate that in Pre-LN LLMs, deeper layers produce highly similar representations to their adjacent layers, and their removal results in minimal performance degradation. In contrast, Post-LN models show the opposite trend: deep layers contribute more substantially to model performance.
  • Figure 3: Results of in-house small-scale LLaMa-130M. Angular Distance (a, b): Each column represents the angular distance from the initial layer $\ell$ (x-axis) and its subsequent $n^{th}$ layer (y-axis). The distance is scaled to the range [0, 1], where yellow indicates smaller distances and purple indicates larger distances. Performance Drop (c, d): ARC-e performance drop of removing each single layer from LLaMa-130M.
  • Figure 4: Layerwise output variance. This figure compares the output variance across various layers for different setups: (1) Pre-LN; (2) Pre-LN with Scaled Initialization shoeybi2019megatronradford2019language; and (3) LayerNorm Scaling. The experiments are conducted on the LLaM-130M model trained for 10,000 steps. The proposed LayerNorm Scaling effectively controls the variance across layers.
  • Figure 5: Visualization of the LayerNorm Jacobian matrices across different layers of a pre-trained LLaMA2-7B model. Each heatmap shows the token-averaged Jacobian at a specific layer. As depth increases, the Jacobians exhibit a pronounced diagonal dominance with vanishing off-diagonal entries, indicating that deep LayerNorm blocks increasingly approximate identity mappings.
  • ...and 5 more figures

Theorems & Definitions (12)

  • Lemma 3.2
  • Theorem 3.3
  • Lemma 4.1
  • Theorem 4.2
  • Theorem 4.3
  • proof
  • Lemma A.1: ledoux2001concentration
  • Lemma A.2
  • proof
  • proof
  • ...and 2 more