Table of Contents
Fetching ...

Axiomatic Causal Interventions for Reverse Engineering Relevance Computation in Neural Retrieval Models

Catherine Chen, Jack Merullo, Carsten Eickhoff

TL;DR

The paper addresses the opacity of neural ranking models by introducing a causal-intervention framework that blends axiomatic IR with mechanistic interpretability. It adapts activation patching to retrieval, enabling causal localization of relevance criteria within a DistilBERT-based TAS-B encoder and testing adherence to the $TFC1$ term-frequency axiom. The key finding is that four attention heads (0.9, 1.6, 2.3, 3.8) encode the term-frequency signal and drive ranking via interactions across layers, with signals migrating to the CLS pooling in later layers. This work lays the groundwork for decomposing relevance into compositional components, informing safer deployment, model editing, and robustness against adversarial perturbations.

Abstract

Neural models have demonstrated remarkable performance across diverse ranking tasks. However, the processes and internal mechanisms along which they determine relevance are still largely unknown. Existing approaches for analyzing neural ranker behavior with respect to IR properties rely either on assessing overall model behavior or employing probing methods that may offer an incomplete understanding of causal mechanisms. To provide a more granular understanding of internal model decision-making processes, we propose the use of causal interventions to reverse engineer neural rankers, and demonstrate how mechanistic interpretability methods can be used to isolate components satisfying term-frequency axioms within a ranking model. We identify a group of attention heads that detect duplicate tokens in earlier layers of the model, then communicate with downstream heads to compute overall document relevance. More generally, we propose that this style of mechanistic analysis opens up avenues for reverse engineering the processes neural retrieval models use to compute relevance. This work aims to initiate granular interpretability efforts that will not only benefit retrieval model development and training, but ultimately ensure safer deployment of these models.

Axiomatic Causal Interventions for Reverse Engineering Relevance Computation in Neural Retrieval Models

TL;DR

The paper addresses the opacity of neural ranking models by introducing a causal-intervention framework that blends axiomatic IR with mechanistic interpretability. It adapts activation patching to retrieval, enabling causal localization of relevance criteria within a DistilBERT-based TAS-B encoder and testing adherence to the term-frequency axiom. The key finding is that four attention heads (0.9, 1.6, 2.3, 3.8) encode the term-frequency signal and drive ranking via interactions across layers, with signals migrating to the CLS pooling in later layers. This work lays the groundwork for decomposing relevance into compositional components, informing safer deployment, model editing, and robustness against adversarial perturbations.

Abstract

Neural models have demonstrated remarkable performance across diverse ranking tasks. However, the processes and internal mechanisms along which they determine relevance are still largely unknown. Existing approaches for analyzing neural ranker behavior with respect to IR properties rely either on assessing overall model behavior or employing probing methods that may offer an incomplete understanding of causal mechanisms. To provide a more granular understanding of internal model decision-making processes, we propose the use of causal interventions to reverse engineer neural rankers, and demonstrate how mechanistic interpretability methods can be used to isolate components satisfying term-frequency axioms within a ranking model. We identify a group of attention heads that detect duplicate tokens in earlier layers of the model, then communicate with downstream heads to compute overall document relevance. More generally, we propose that this style of mechanistic analysis opens up avenues for reverse engineering the processes neural retrieval models use to compute relevance. This work aims to initiate granular interpretability efforts that will not only benefit retrieval model development and training, but ultimately ensure safer deployment of these models.
Paper Structure (33 sections, 10 figures, 1 table)

This paper contains 33 sections, 10 figures, 1 table.

Figures (10)

  • Figure 1: Top: Traditional transformer diagram depicting a linear information flow between blocks. Bottom: Non-sequential transformer diagram demonstrating read and write operations to an assumed common residual stream. Layernorms are not shown for simplicity.
  • Figure 2: Activation patching setup for retrieval. In this example, a document pair is constructed to observe term frequency effects in the model. A perturbed document $X_{perturbed}$ (left) is created by injecting a selected query term ("Wellesley") at the end, and a baseline comparison document $X_{baseline}$ (right) is created by adding filler tokens to equalize document lengths. Each input is run through the model, with network components reading and writing information to their respective residual streams. In a third patched run, the model runs on $X_{baseline}$, and an activation (e.g., attention head output) from the cached $X_{perturbed}$ run is patched in. The model continues its run, with the residual stream now affected by the patch to produce a new ranking score.
  • Figure 3: Example of a perturbed and baseline document pair for TFC1-I, labeled by token types.
  • Figure 4: Results from patching into the residual stream at the start of each layer over all tokens positions in the document.
  • Figure 5: TFC1-I residual stream patching results by location of term injection. The left shows results from the original patching setup, with the selected query term injected at the end of each document. On the right, selected query terms are injected at the beginning of the document.
  • ...and 5 more figures