Table of Contents
Fetching ...

The Unreasonable Ineffectiveness of the Deeper Layers

Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, Daniel A. Roberts

TL;DR

The paper shows that deep-layer pruning in open-weight LLMs can preserve knowledge-intensive QA performance with minimal degradation, using a similarity-based criterion to remove blocks of layers and a small QLoRA-based healing step. By analyzing angular distances between layer representations, it demonstrates that deeper layers tend to be redundant for storing knowledge, though they matter for reasoning tasks and longer generation. Healing the pruned interface eliminates sharp losses in next-token prediction and yields continuous performance across pruning fractions, revealing a miscalibration between QA metrics and autoregressive loss. The findings challenge assumptions about where knowledge resides in LLMs and point to practical compression strategies that retain QA capabilities while enabling substantial parameter reduction.

Abstract

How is knowledge stored in an LLM's weights? We study this via layer pruning: if removing a certain layer does not affect model performance in common question-answering benchmarks, then the weights in that layer are not necessary for storing the knowledge needed to answer those questions. To find these unnecessary parameters, we identify the optimal block of layers to prune by considering similarity across layers; then, to "heal" the damage, we perform a small amount of finetuning. Surprisingly, with this method we find minimal degradation of performance until after a large fraction (up to half) of the layers are removed for some common open-weight models. From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge. For our study, we use parameter-efficient finetuning (PEFT) methods, specifically quantization and Low Rank Adapters (QLoRA), such that each of our experiments can be performed on a single 40GB A100 GPU.

The Unreasonable Ineffectiveness of the Deeper Layers

TL;DR

The paper shows that deep-layer pruning in open-weight LLMs can preserve knowledge-intensive QA performance with minimal degradation, using a similarity-based criterion to remove blocks of layers and a small QLoRA-based healing step. By analyzing angular distances between layer representations, it demonstrates that deeper layers tend to be redundant for storing knowledge, though they matter for reasoning tasks and longer generation. Healing the pruned interface eliminates sharp losses in next-token prediction and yields continuous performance across pruning fractions, revealing a miscalibration between QA metrics and autoregressive loss. The findings challenge assumptions about where knowledge resides in LLMs and point to practical compression strategies that retain QA capabilities while enabling substantial parameter reduction.

Abstract

How is knowledge stored in an LLM's weights? We study this via layer pruning: if removing a certain layer does not affect model performance in common question-answering benchmarks, then the weights in that layer are not necessary for storing the knowledge needed to answer those questions. To find these unnecessary parameters, we identify the optimal block of layers to prune by considering similarity across layers; then, to "heal" the damage, we perform a small amount of finetuning. Surprisingly, with this method we find minimal degradation of performance until after a large fraction (up to half) of the layers are removed for some common open-weight models. From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge. For our study, we use parameter-efficient finetuning (PEFT) methods, specifically quantization and Low Rank Adapters (QLoRA), such that each of our experiments can be performed on a single 40GB A100 GPU.
Paper Structure (24 sections, 6 equations, 11 figures)

This paper contains 24 sections, 6 equations, 11 figures.

Figures (11)

  • Figure 1: Overview of our layer-pruning strategy and example results: (a) a flowchart describing the algorithm: if removing $n$ layers, we find the layer, $\ell^*$, that minimizes the angular distance, $d$, between layers $\ell$ and $\ell\! +\! n$; we then remove the $n$ layers beginning with layer $\ell^*$; finally, if necessary, we can "heal" the damage with a small amount of (parameter-efficient) finetuning. (b) a schematic depicting the removal of $n$ total layers, indexed from $\ell^*\!$ to $\ell^* \!\!+ \!n \!-\! 1$. (c) angular distance, $d$, between different numbers of layers, $n$, vs. the layer number, $\ell$, that indexes the beginning of the block of $n$; the bottom curve (darkest purple) represents $n=1$, while the top curve (lightest yellow) represents $n=64$; the black line traces $\ell^*(n)$, the minimum of the angular distance across the different sized layer blocks. (d) results of pruning Llama-2-70B with healing (light blue) and without healing (dark blue) as a function of the fraction of layers removed: the top (middle) panel gives the accuracy on the MMLU (BoolQ) question-answering benchmark, while the bottom panel the autoregressive loss on a subset of the C4 validation set; here, the dashed red lines (dashed gray lines) indicate the accuracy or loss of the original unpruned model (of random guessing); these plots illustrate that typical behavior we find in which there are sharp transitions in performance for the accuracy of question-answering tasks (here between 40%-50% pruning fraction), but continuity and very slow growth in the healed loss (light blue) up to at least to 80% pruning fraction.
  • Figure 2: MMLU accuracy (5-shot) vs. fraction of layers dropped for different model families. (Left: Llama-2 family; Middle: Qwen family; Right: Mistral-7B and Phi-2.) The solid lines represent performance after dropping layers and healing, dotted lines show performance after dropping layers only (no healing), and the dashed gray line is the score for guessing randomly. For these models, healing leads to modest improvements, and performances are quite robust until 20%-55% pruning fractions, depending on model family and size, at which point they transitions to random guessing.
  • Figure 3: Normalized C4 validation loss vs. fraction of layers dropped before healing (left) and after healing (right); each curve is normalized by the cross-entropy loss of sampling uniformly from the model's vocabulary. For the experiments before healing, the loss for each model transitions to random guessing (gray dashed line) at approximately the same pruning fractions that the QA benchmarks transition to random guessing; after healing, there is continuity through the regions of sharp transition on QA tasks, cf. Figure \ref{['fig:main-results-pruning']}. Contrasting the overall scale of both plots, it's clear that healing significantly restores the performance on next-token prediction to near-unpruned levels.
  • Figure 4: Normalized angular distance \ref{['eq:arccos-sim']} from initial layer $\ell$ (x-axis) with block size $n$ (y-axis) for each of the seven models we evaluated; the distance for each $n$ is shifted and rescaled to span the same range, $[0,1]$ (yellow to purple): the optimal block to prune, $\ell^*(n)$, corresponds to the deepest yellow for each row. Across models, the deeper layers tend to be very similar, though the deepest blocks that include the final layer (squares along the outer diagonal) are (near-)maximally dissimilar.
  • Figure 5: Evaluation of Llama-2-70B with the simple pruning heuristic (solid red line), shown along with scores for the similarity-informed pruning strategy (solid blue line), scores of the unpruned Llama-2-70B (red dashed line), and scores for randomly guessing (gray dashed line). (Left: before healing, Right: after healing; Top: MMLU, Middle: BoolQ, Bottom: C4 Validation Loss.) Without healing, the simple heuristic performs poorly across all evals; with healing, the scores of both methods are quite similar.
  • ...and 6 more figures