Table of Contents
Fetching ...

When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models

Sunny Sanyal, Ravid Shwartz-Ziv, Alexandros G. Dimakis, Sujay Sanghavi

TL;DR

This work identifies attention degeneration in decoder-style LLMs, where deeper layers frequently produce near rank-1 attention patterns, forming lazy layers that reduce expressive capacity. It introduces Inheritune, a zero-shot initialization and progressive growth method that constructs smaller base LMs by inheriting early layers from a larger pre-trained model and then expanding the model while retraining. The authors define per-layer rank metrics, notably the approximate rank $k^*$ with threshold $\tau=0.90$, and show lazy layers offer limited transferable knowledge. Across GPT-2 variants on OpenWebText and FineWeb_edu, Inheritune-trained models achieve parity or surpass larger models trained from scratch and outperform zero-shot initialization baselines and some distillation approaches. The work provides a practical, data-efficient approach to LM compression and enables efficient deployment of high-performing, compact language models, with code released for replication.

Abstract

Large Language Models (LLMs) rely on the transformer architecture and its self-attention mechanism to deliver strong performance across tasks. However, we uncover a structural inefficiency in standard pre-trained decoder-style LLMs: in many of the deeper layers, attention matrices frequently collapse to near rank-one, single-column patterns. We refer to these underutilized components as lazy layers, which are redundant and computationally inefficient. To address this, we propose Inheritune, a simple and effective training recipe for building smaller, more efficient, and high performing language models. Inheritune initializes a compact model by inheriting the useful early layers from a larger pre-trained model, then progressively retrains and expands it. Our experiments across multiple models and datasets show that Inheritune trained models, despite having significantly fewer layers, can match or even outperform their larger counterparts. This approach yields compact, performant models and offers a practical path for efficient language model compression. Code is available at https://github.com/sanyalsunny111/LLM-Inheritune

When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models

TL;DR

This work identifies attention degeneration in decoder-style LLMs, where deeper layers frequently produce near rank-1 attention patterns, forming lazy layers that reduce expressive capacity. It introduces Inheritune, a zero-shot initialization and progressive growth method that constructs smaller base LMs by inheriting early layers from a larger pre-trained model and then expanding the model while retraining. The authors define per-layer rank metrics, notably the approximate rank with threshold , and show lazy layers offer limited transferable knowledge. Across GPT-2 variants on OpenWebText and FineWeb_edu, Inheritune-trained models achieve parity or surpass larger models trained from scratch and outperform zero-shot initialization baselines and some distillation approaches. The work provides a practical, data-efficient approach to LM compression and enables efficient deployment of high-performing, compact language models, with code released for replication.

Abstract

Large Language Models (LLMs) rely on the transformer architecture and its self-attention mechanism to deliver strong performance across tasks. However, we uncover a structural inefficiency in standard pre-trained decoder-style LLMs: in many of the deeper layers, attention matrices frequently collapse to near rank-one, single-column patterns. We refer to these underutilized components as lazy layers, which are redundant and computationally inefficient. To address this, we propose Inheritune, a simple and effective training recipe for building smaller, more efficient, and high performing language models. Inheritune initializes a compact model by inheriting the useful early layers from a larger pre-trained model, then progressively retrains and expands it. Our experiments across multiple models and datasets show that Inheritune trained models, despite having significantly fewer layers, can match or even outperform their larger counterparts. This approach yields compact, performant models and offers a practical path for efficient language model compression. Code is available at https://github.com/sanyalsunny111/LLM-Inheritune
Paper Structure (52 sections, 3 equations, 23 figures, 11 tables, 1 algorithm)

This paper contains 52 sections, 3 equations, 23 figures, 11 tables, 1 algorithm.

Figures (23)

  • Figure 1: In decoder-style LLMs, attention matrices in deeper layers often degenerate to near rank-1, limiting their ability to learn meaningful representations. We compute $\text{MaxRank}^{(l)}$ (averaged over $N = 100$ randomly selected sequences each with $T=100$ tokens) for each layer $l$ using the OpenWebText validation set. Our rank analysis of 24-layer GPT-2 medium, 36-layer GPT-2 large, and 48-layer GPT-2 xlarge models reveals that attention matrices in many deeper layers collapse to near rank-1.
  • Figure 2: Higher-rank layers transfer better. (Left, \ref{['fig:small_rank_profile_wrap']}) Layer-wise $\mathrm{MaxRank}^{(l)}$ of a pre-trained 12L GPT-2 Small. (Right, \ref{['fig:small_variants_perf_wrap']}) Validation loss of 4L variants initialized with potent blocks (AvgRank $\approx 8.4-9.5$) vs. a lazy block (AvgRank $\approx 1.2$) or random weights, after 100K steps. Lazy block initialization mirrors random.
  • Figure 3: When initializing 12-layer and 16-layer variants of GPT2-medium and GPT2-large with deeper (lazy) layers showing degenerated attention, performance is comparable to random initialization. In contrast, early-layer initialization leads to significantly better convergence and generalization.
  • Figure 4: Overview of the $\mathsf{Inheritune}$ training recipe using a 24-Layer GPT-2 medium model example. A smaller target model is initialized using early layers from a larger, pre-trained reference model. The target model goes multiple rounds of training while inheriting more early layers until it matches the reference model. The intensity of the red color in layers correlates with $\mathrm{MaxRank}^{(l)}$.
  • Figure 5: Models derived using Inheritune converge faster and match the final validation loss of the full-sized model, despite having much fewer layers. Comparison of $\mathsf{Inheritune}$-trained models (24-layer GPT-2 xLarge variant, 18-layer GPT-2 Large variant, 16-layer GPT-2 Medium variant) against their full-sized counterparts and same sized variants trained from scratch. All models are trained for 100K steps using OpenWebText data with data repetition.
  • ...and 18 more figures