Table of Contents
Fetching ...

How do Large Language Models Understand Relevance? A Mechanistic Interpretability Perspective

Qi Liu, Jiaxin Mao, Ji-Rong Wen

TL;DR

This work probes how large language models internalize relevance signals for information retrieval using activation patching within a mechanistic interpretability framework. It uncovers a progressive information flow where early layers extract document and query content, middle layers integrate task instructions, and late layers deploy specific attention heads to format the final relevance judgments. The study identifies a sparse set of attention heads that mediate query–document interactions and demonstrates that the same mechanism operates across pointwise and pairwise prompts and across multiple models and datasets. These findings provide a principled view of LLM-based IR and offer guidance for designing more interpretable and reliable relevance assessment systems.

Abstract

Recent studies have shown that large language models (LLMs) can assess relevance and support information retrieval (IR) tasks such as document ranking and relevance judgment generation. However, the internal mechanisms by which off-the-shelf LLMs understand and operationalize relevance remain largely unexplored. In this paper, we systematically investigate how different LLM modules contribute to relevance judgment through the lens of mechanistic interpretability. Using activation patching techniques, we analyze the roles of various model components and identify a multi-stage, progressive process in generating either pointwise or pairwise relevance judgment. Specifically, LLMs first extract query and document information in the early layers, then process relevance information according to instructions in the middle layers, and finally utilize specific attention heads in the later layers to generate relevance judgments in the required format. Our findings provide insights into the mechanisms underlying relevance assessment in LLMs, offering valuable implications for future research on leveraging LLMs for IR tasks.

How do Large Language Models Understand Relevance? A Mechanistic Interpretability Perspective

TL;DR

This work probes how large language models internalize relevance signals for information retrieval using activation patching within a mechanistic interpretability framework. It uncovers a progressive information flow where early layers extract document and query content, middle layers integrate task instructions, and late layers deploy specific attention heads to format the final relevance judgments. The study identifies a sparse set of attention heads that mediate query–document interactions and demonstrates that the same mechanism operates across pointwise and pairwise prompts and across multiple models and datasets. These findings provide a principled view of LLM-based IR and offer guidance for designing more interpretable and reliable relevance assessment systems.

Abstract

Recent studies have shown that large language models (LLMs) can assess relevance and support information retrieval (IR) tasks such as document ranking and relevance judgment generation. However, the internal mechanisms by which off-the-shelf LLMs understand and operationalize relevance remain largely unexplored. In this paper, we systematically investigate how different LLM modules contribute to relevance judgment through the lens of mechanistic interpretability. Using activation patching techniques, we analyze the roles of various model components and identify a multi-stage, progressive process in generating either pointwise or pairwise relevance judgment. Specifically, LLMs first extract query and document information in the early layers, then process relevance information according to instructions in the middle layers, and finally utilize specific attention heads in the later layers to generate relevance judgments in the required format. Our findings provide insights into the mechanisms underlying relevance assessment in LLMs, offering valuable implications for future research on leveraging LLMs for IR tasks.

Paper Structure

This paper contains 27 sections, 8 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Visualization of an activation patching example using pointwise style prompt (shown in (a)). Activation patching computes the effect of a specific module by running the LLM three times: a clean run (b) with the positive document, a corrupted run with the negative document, and a patched run with corrupted input but the activation of the selected module is replaced with the value in the clean run. Then the effect is computed based on the patched output. Potential information flow within LLMs is shown in (b): LLM first capture information in document and query in early layers (green modules), then receive task information in middle layer (blue modules), finally control the result generation in deeper layer (purple modules).
  • Figure 2: Illustration of Causal Mediation Analysis.
  • Figure 3: Indirect Effect of different model components at different token positions within LLama-3.1-8B-Instruct on 100 samples from MS MARCO.
  • Figure 4: Rank Bias Overlap ($p = 0.7$) of pointwise and pairwise prompts at different components and token positions.
  • Figure 5: Indirect effect of individual head. (a) Head's output at last token. (b) Heads' output at the position of the query. (c) Heads' attention scores at the position of query-document. Several heads with the highest effects are highlighted in pink.
  • ...and 3 more figures