Table of Contents
Fetching ...

Demystifying the Roles of LLM Layers in Retrieval, Knowledge, and Reasoning

Xinyuan Song, Keyu Wang, PengXiang Li, Lu Yin, Shiwei Liu

TL;DR

The paper investigates how the depth of large language models contributes to knowledge, retrieval, and reasoning across diverse evaluation protocols and model architectures. It introduces a comprehensive layer-pruning framework and demonstrates that layer importance is highly task- and metric-dependent: knowledge and retrieval rely on shallow layers, while reasoning and generation depend on mid-to-deep layers. Distillation reshapes depth usage by redistributing reasoning capacity across layers and heads, often increasing robustness to pruning but not eliminating depth-specific dependencies. The findings underscore the need for task-, metric-, and model-aware evaluation to guide compression and design of future LLMs with reliable, interpretable depth utilization.

Abstract

Recent studies suggest that the deeper layers of Large Language Models (LLMs) contribute little to representation learning and can often be removed without significant performance loss. However, such claims are typically drawn from narrow evaluations and may overlook important aspects of model behavior. In this work, we present a systematic study of depth utilization across diverse dimensions, including evaluation protocols, task categories, and model architectures. Our analysis confirms that very deep layers are generally less effective than earlier ones, but their contributions vary substantially with the evaluation setting. Under likelihood-based metrics without generation, pruning most layers preserves performance, with only the initial few being critical. By contrast, generation-based evaluation uncovers indispensable roles for middle and deeper layers in enabling reasoning and maintaining long-range coherence. We further find that knowledge and retrieval are concentrated in shallow components, whereas reasoning accuracy relies heavily on deeper layers -- yet can be reshaped through distillation. These results highlight that depth usage in LLMs is highly heterogeneous and context-dependent, underscoring the need for task-, metric-, and model-aware perspectives in both interpreting and compressing large models.

Demystifying the Roles of LLM Layers in Retrieval, Knowledge, and Reasoning

TL;DR

The paper investigates how the depth of large language models contributes to knowledge, retrieval, and reasoning across diverse evaluation protocols and model architectures. It introduces a comprehensive layer-pruning framework and demonstrates that layer importance is highly task- and metric-dependent: knowledge and retrieval rely on shallow layers, while reasoning and generation depend on mid-to-deep layers. Distillation reshapes depth usage by redistributing reasoning capacity across layers and heads, often increasing robustness to pruning but not eliminating depth-specific dependencies. The findings underscore the need for task-, metric-, and model-aware evaluation to guide compression and design of future LLMs with reliable, interpretable depth utilization.

Abstract

Recent studies suggest that the deeper layers of Large Language Models (LLMs) contribute little to representation learning and can often be removed without significant performance loss. However, such claims are typically drawn from narrow evaluations and may overlook important aspects of model behavior. In this work, we present a systematic study of depth utilization across diverse dimensions, including evaluation protocols, task categories, and model architectures. Our analysis confirms that very deep layers are generally less effective than earlier ones, but their contributions vary substantially with the evaluation setting. Under likelihood-based metrics without generation, pruning most layers preserves performance, with only the initial few being critical. By contrast, generation-based evaluation uncovers indispensable roles for middle and deeper layers in enabling reasoning and maintaining long-range coherence. We further find that knowledge and retrieval are concentrated in shallow components, whereas reasoning accuracy relies heavily on deeper layers -- yet can be reshaped through distillation. These results highlight that depth usage in LLMs is highly heterogeneous and context-dependent, underscoring the need for task-, metric-, and model-aware perspectives in both interpreting and compressing large models.

Paper Structure

This paper contains 22 sections, 1 equation, 22 figures.

Figures (22)

  • Figure 1: Layer pruning results of LLaMA-3.1-8B on the MMLU benchmark under three evaluation protocols: log-likelihood default (left), log-likelihood continuation (middle), and generation until (right). We report accuracy ($\mu$) and relative change ($\Delta \mu$) across layers.
  • Figure 2: Layer pruning results of Qwen3-8B on the MMLU benchmark under the same three evaluation protocols. Accuracy ($\mu$) and relative change ($\Delta \mu$) are shown across layer indices.
  • Figure 3: layer pruning results of LLaMA-3.1-8B on the HellaSwag dataset. We report both standard accuracy ($\textbf{acc}$) and cross-entropy–based accuracy (acc_ce), along with their relative differences compared to the unablated model across layers.
  • Figure 4: layer pruning results of LLaMA-3.1-8B on the MathQA dataset.
  • Figure 5: layer pruning results of LLaMA-3.1-8B on the KV Retrieval task. (a) accuracy $\mu$ (blue), (b) $\Delta \mu$ (yellow).
  • ...and 17 more figures