Table of Contents
Fetching ...

The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation

Patrick Kahardipraja, Reduan Achtibat, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin

TL;DR

This work interrogates how attention heads drive in-context retrieval augmentation in large language models. By introducing AttnLRP, an attribution framework for transformer heads, it separates in-context heads (processing the prompt) from parametric heads (storing relational knowledge) and demonstrates their distinct functional roles. The authors show that in-context heads specialize into task- and retrieval-related functions, and that manipulating per-head function vectors or attention weights can causally influence generation and enable source tracking of retrieved knowledge. A retrieval-head probe further enables efficient provenance tracing, contributing to safer and more transparent retrieval-augmented LMs. The findings suggest practical pathways for controlling knowledge sources and improving interpretability in RAG systems, while acknowledging limitations and avenues for future work.

Abstract

Large language models are able to exploit in-context learning to access external knowledge beyond their training data through retrieval-augmentation. While promising, its inner workings remain unclear. In this work, we shed light on the mechanism of in-context retrieval augmentation for question answering by viewing a prompt as a composition of informational components. We propose an attribution-based method to identify specialized attention heads, revealing in-context heads that comprehend instructions and retrieve relevant contextual information, and parametric heads that store entities' relational knowledge. To better understand their roles, we extract function vectors and modify their attention weights to show how they can influence the answer generation process. Finally, we leverage the gained insights to trace the sources of knowledge used during inference, paving the way towards more safe and transparent language models.

The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation

TL;DR

This work interrogates how attention heads drive in-context retrieval augmentation in large language models. By introducing AttnLRP, an attribution framework for transformer heads, it separates in-context heads (processing the prompt) from parametric heads (storing relational knowledge) and demonstrates their distinct functional roles. The authors show that in-context heads specialize into task- and retrieval-related functions, and that manipulating per-head function vectors or attention weights can causally influence generation and enable source tracking of retrieved knowledge. A retrieval-head probe further enables efficient provenance tracing, contributing to safer and more transparent retrieval-augmented LMs. The findings suggest practical pathways for controlling knowledge sources and improving interpretability in RAG systems, while acknowledging limitations and avenues for future work.

Abstract

Large language models are able to exploit in-context learning to access external knowledge beyond their training data through retrieval-augmentation. While promising, its inner workings remain unclear. In this work, we shed light on the mechanism of in-context retrieval augmentation for question answering by viewing a prompt as a composition of informational components. We propose an attribution-based method to identify specialized attention heads, revealing in-context heads that comprehend instructions and retrieve relevant contextual information, and parametric heads that store entities' relational knowledge. To better understand their roles, we extract function vectors and modify their attention weights to show how they can influence the answer generation process. Finally, we leverage the gained insights to trace the sources of knowledge used during inference, paving the way towards more safe and transparent language models.

Paper Structure

This paper contains 40 sections, 18 equations, 18 figures, 14 tables.

Figures (18)

  • Figure 1: Functional map of in-context and parametric heads in Llama-3.1-8B-Instruct. They are surprisingly well-structured and operate on the input prompt at various levels, with in-context heads processing information in the prompt, including instruction comprehension and retrieval operations --- and parametric heads that encode relational knowledge. In-context heads can specialize to task heads to parse instructions (blue) or retrieval heads for verbatim copying (green). Together with parametric heads, they affect the answer generation process through function vectors that they transport (a, c) or their attention weights (b). Our relevance analysis (bar plot) shows that instruction-following capabilities emerge in middle layers, while answer retrieval occurs in later layers. Details in \ref{['app:functional_map']}.
  • Figure 2: Recall analysis for Llama 3.1 when either in-context or parametric heads are ablated. Removing identified in-context heads noticeably affects the model's performance in open-book QA across various configurations. Conversely, removal of identified parametric heads most strongly affects the model's closed-book QA capabilities. Compared to wu2025retrieval that only yield AWR (retrieval) heads, our method allows to obtain both in-context and parametric heads.
  • Figure 3: Sorted in-context scores for 1024 heads of Llama 3.1, comparing open-book and closed-book settings via score $\mathcal{D}$. Positive scores indicate in-context behavior, while negative scores reflect parametric behavior. Retrieval heads (green) and task heads (blue) are predominantly high-scoring in-context heads. See Appendix Figure \ref{['app:fig:head_distribution']} for other models.
  • Figure 4: Extraction and insertion of task and parametric FVs. The induced generation is highlighted in italic.
  • Figure 5: (a) When asked “Where does llama originate from?”, the retrieval-head probe classifies “South America” and “Africa” as parametric, while “Meta” as contextual. The UMAP projection of retrieval head activations displays the linear probe’s decision boundary (dashed line) separating parametric from contextual clusters. (b) The weighted aggregation of retrieval head attention maps at the final query position is superimposed on the document to pinpoint the retrieved source span.
  • ...and 13 more figures