Table of Contents
Fetching ...

Retrieval Heads are Dynamic

Yuping Lin, Zitao Li, Yue Xing, Pengfei He, Yingqian Cui, Yaliang Li, Bolin Ding, Jingren Zhou, Jiliang Tang

TL;DR

This work reframes retrieval heads in autoregressive LLMs as a dynamic, context-dependent component rather than a fixed subset. It reveals three core findings: retrieval heads vary across generation steps (dynamism), the specific dynamic heads at a given step are irreplaceable by static ones, and the model's final hidden state strongly encodes future retrieval patterns (correlation). The authors validate these claims on Needle-in-a-Haystack and HotpotQA and demonstrate practical benefits by integrating dynamic retrieval heads into a Dynamic RAG framework, achieving improved retrieval and reasoning performance. These results illuminate a planning-like mechanism in LLMs and suggest state-aware intervention strategies for more precise, context-sensitive information retrieval.

Abstract

Recent studies have identified "retrieval heads" in Large Language Models (LLMs) responsible for extracting information from input contexts. However, prior works largely rely on static statistics aggregated across datasets, identifying heads that perform retrieval on average. This perspective overlooks the fine-grained temporal dynamics of autoregressive generation. In this paper, we investigate retrieval heads from a dynamic perspective. Through extensive analysis, we establish three core claims: (1) Dynamism: Retrieval heads vary dynamically across timesteps; (2) Irreplaceability: Dynamic retrieval heads are specific at each timestep and cannot be effectively replaced by static retrieval heads; and (3) Correlation: The model's hidden state encodes a predictive signal for future retrieval head patterns, indicating an internal planning mechanism. We validate these findings on the Needle-in-a-Haystack task and a multi-hop QA task, and quantify the differences on the utility of dynamic and static retrieval heads in a Dynamic Retrieval-Augmented Generation framework. Our study provides new insights into the internal mechanisms of LLMs.

Retrieval Heads are Dynamic

TL;DR

This work reframes retrieval heads in autoregressive LLMs as a dynamic, context-dependent component rather than a fixed subset. It reveals three core findings: retrieval heads vary across generation steps (dynamism), the specific dynamic heads at a given step are irreplaceable by static ones, and the model's final hidden state strongly encodes future retrieval patterns (correlation). The authors validate these claims on Needle-in-a-Haystack and HotpotQA and demonstrate practical benefits by integrating dynamic retrieval heads into a Dynamic RAG framework, achieving improved retrieval and reasoning performance. These results illuminate a planning-like mechanism in LLMs and suggest state-aware intervention strategies for more precise, context-sensitive information retrieval.

Abstract

Recent studies have identified "retrieval heads" in Large Language Models (LLMs) responsible for extracting information from input contexts. However, prior works largely rely on static statistics aggregated across datasets, identifying heads that perform retrieval on average. This perspective overlooks the fine-grained temporal dynamics of autoregressive generation. In this paper, we investigate retrieval heads from a dynamic perspective. Through extensive analysis, we establish three core claims: (1) Dynamism: Retrieval heads vary dynamically across timesteps; (2) Irreplaceability: Dynamic retrieval heads are specific at each timestep and cannot be effectively replaced by static retrieval heads; and (3) Correlation: The model's hidden state encodes a predictive signal for future retrieval head patterns, indicating an internal planning mechanism. We validate these findings on the Needle-in-a-Haystack task and a multi-hop QA task, and quantify the differences on the utility of dynamic and static retrieval heads in a Dynamic Retrieval-Augmented Generation framework. Our study provides new insights into the internal mechanisms of LLMs.
Paper Structure (44 sections, 4 equations, 20 figures, 4 tables)

This paper contains 44 sections, 4 equations, 20 figures, 4 tables.

Figures (20)

  • Figure 1: Dynamism of Retrieval Heads. The retrieval scores of individual attention heads fluctuate across the generation process. Dark color indicates heads having a retrieval score of 1, as defined by Equation \ref{['eq:retrieval_score_niah']}. The x-axis denotes the generation step, labeled by the token generated at that step. The y-axis shows the 10 most variated retrieval heads, selected based on their retrieval score variance over the entire generation process. L$x$-H$y$ denotes the $y$-th head (starting from 0) on the layer $x$ (starting from 0).
  • Figure 2: Impact of Head Ablation on Retrieval Performance. Comparison of NIAH test scores after masking three different sets of attention heads: dynamic retrieval heads, top-ranked static retrieval heads, and randomly selected heads on llama3.1-8b. The x-axis shows different haystack lengths. The y-axis shows the different locations ("depth") where the needle is inserted. The evaluation metric is Accuracy (exact string match). The average number of masked heads is kept consistent across all conditions. Masking dynamic heads (identified at each timestep via Eq. \ref{['eq:retrieval_score_niah']}) results in the most significant performance degradation, indicating their critical role in retrieval.
  • Figure 3: Irreplaceability of Dynamic Retrieval Heads. The plots show the degradation in NIAH performance as an increasing number ($k$) of dynamic retrieval heads are masked on llama3.1-8b. Even though the model compensates by activating top-20 static retrieval heads ( blue line, left y-axis), the overall retrieval performance, measured by Accuracy ( red line, right y-axis), continues to decline sharply. This demonstrates that static retrieval heads cannot effectively substitute for context-specific dynamic heads.
  • Figure 4: Predictive Correlation Between Hidden States and Future Retrieval Scores. Canonical Correlation Analysis (CCA) coefficients between the final hidden state at timestep $n$ and the retrieval scores at a future timestep $n+k$. The plot shows the decay of the leading (Top-1) canonical correlation, as well as the average of the Top-10 and Top-50 correlations, as the temporal offset $k$ increases. The high correlation at $k>0$ demonstrates the model's anticipatory encoding of future retrieval intent.
  • Figure 5: Dynamic Pattern of Retrieval Heads in a Multi-Hop Reasoning Task. The heatmap illustrates the retrieval scores (defined in Eq. \ref{['eq:retrieval_score_ratio']}) for ten active retrieval heads over the course of the generation process.
  • ...and 15 more figures