When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models
Sunny Sanyal, Ravid Shwartz-Ziv, Alexandros G. Dimakis, Sujay Sanghavi
TL;DR
This work identifies attention degeneration in decoder-style LLMs, where deeper layers frequently produce near rank-1 attention patterns, forming lazy layers that reduce expressive capacity. It introduces Inheritune, a zero-shot initialization and progressive growth method that constructs smaller base LMs by inheriting early layers from a larger pre-trained model and then expanding the model while retraining. The authors define per-layer rank metrics, notably the approximate rank $k^*$ with threshold $\tau=0.90$, and show lazy layers offer limited transferable knowledge. Across GPT-2 variants on OpenWebText and FineWeb_edu, Inheritune-trained models achieve parity or surpass larger models trained from scratch and outperform zero-shot initialization baselines and some distillation approaches. The work provides a practical, data-efficient approach to LM compression and enables efficient deployment of high-performing, compact language models, with code released for replication.
Abstract
Large Language Models (LLMs) rely on the transformer architecture and its self-attention mechanism to deliver strong performance across tasks. However, we uncover a structural inefficiency in standard pre-trained decoder-style LLMs: in many of the deeper layers, attention matrices frequently collapse to near rank-one, single-column patterns. We refer to these underutilized components as lazy layers, which are redundant and computationally inefficient. To address this, we propose Inheritune, a simple and effective training recipe for building smaller, more efficient, and high performing language models. Inheritune initializes a compact model by inheriting the useful early layers from a larger pre-trained model, then progressively retrains and expands it. Our experiments across multiple models and datasets show that Inheritune trained models, despite having significantly fewer layers, can match or even outperform their larger counterparts. This approach yields compact, performant models and offers a practical path for efficient language model compression. Code is available at https://github.com/sanyalsunny111/LLM-Inheritune
