Table of Contents
Fetching ...

On Mechanistic Circuits for Extractive Question-Answering

Samyadeep Basu, Vlad Morariu, Zichao Wang, Ryan Rossi, Cherry Zhao, Soheil Feizi, Varun Manjunatha

TL;DR

The paper investigates mechanistic circuits inside large language models to understand extractive QA, distinguishing when answers rely on retrieved context versus parametric memory. It introduces a CMA-based framework to extract two circuits—Context-Faithfulness and Memory-Faithfulness—from multiple models, revealing that a small set of attention heads drive context attribution. Building on this, it presents AttnAttrib, a single-head attribution method that achieves strong data attribution across benchmarks and can be used to steer models toward context-faithful answering in a forward pass. The work demonstrates practical applications for grounding and reliability in context-augmented QA and shows generalizability to larger models, offering a blueprint for leveraging mechanistic insights in real-world deployments.

Abstract

Large language models are increasingly used to process documents and facilitate question-answering on them. In our paper, we extract mechanistic circuits for this real-world language modeling task: context-augmented language modeling for extractive question-answering (QA) tasks and understand the potential benefits of circuits towards downstream applications such as data attribution to context information. We extract circuits as a function of internal model components (e.g., attention heads, MLPs) using causal mediation analysis techniques. Leveraging the extracted circuits, we first understand the interplay between the model's usage of parametric memory and retrieved context towards a better mechanistic understanding of context-augmented language models. We then identify a small set of attention heads in our circuit which performs reliable data attribution by default, thereby obtaining attribution for free in just the model's forward pass. Using this insight, we then introduce ATTNATTRIB, a fast data attribution algorithm which obtains state-of-the-art attribution results across various extractive QA benchmarks. Finally, we show the possibility to steer the language model towards answering from the context, instead of the parametric memory by using the attribution from ATTNATTRIB as an additional signal during the forward pass. Beyond mechanistic understanding, our paper provides tangible applications of circuits in the form of reliable data attribution and model steering.

On Mechanistic Circuits for Extractive Question-Answering

TL;DR

The paper investigates mechanistic circuits inside large language models to understand extractive QA, distinguishing when answers rely on retrieved context versus parametric memory. It introduces a CMA-based framework to extract two circuits—Context-Faithfulness and Memory-Faithfulness—from multiple models, revealing that a small set of attention heads drive context attribution. Building on this, it presents AttnAttrib, a single-head attribution method that achieves strong data attribution across benchmarks and can be used to steer models toward context-faithful answering in a forward pass. The work demonstrates practical applications for grounding and reliability in context-augmented QA and shows generalizability to larger models, offering a blueprint for leveraging mechanistic insights in real-world deployments.

Abstract

Large language models are increasingly used to process documents and facilitate question-answering on them. In our paper, we extract mechanistic circuits for this real-world language modeling task: context-augmented language modeling for extractive question-answering (QA) tasks and understand the potential benefits of circuits towards downstream applications such as data attribution to context information. We extract circuits as a function of internal model components (e.g., attention heads, MLPs) using causal mediation analysis techniques. Leveraging the extracted circuits, we first understand the interplay between the model's usage of parametric memory and retrieved context towards a better mechanistic understanding of context-augmented language models. We then identify a small set of attention heads in our circuit which performs reliable data attribution by default, thereby obtaining attribution for free in just the model's forward pass. Using this insight, we then introduce ATTNATTRIB, a fast data attribution algorithm which obtains state-of-the-art attribution results across various extractive QA benchmarks. Finally, we show the possibility to steer the language model towards answering from the context, instead of the parametric memory by using the attribution from ATTNATTRIB as an additional signal during the forward pass. Beyond mechanistic understanding, our paper provides tangible applications of circuits in the form of reliable data attribution and model steering.

Paper Structure

This paper contains 58 sections, 1 equation, 19 figures, 1 algorithm.

Figures (19)

  • Figure 1: Obtaining Circuits for Extractive QA in Language Models. We use our probe dataset along with path patching to extract circuits corresponding to (i) Context and (ii) Memory Faithfulness. We find that a small set of attention heads from the circuit can be used towards performing data-attribution in one forward pass and also steering language models towards context faithfulness. In this figure, we provide one step of the patching operation and expand on it Sec.(\ref{['patching_illustration']}).
  • Figure 2: (i) Top Row (Context Circuit Components). We find that a small set of attention layers and attention heads are sufficient towards a high average metric score across all the models. However we find that for Vicuna and Phi-3, patching MLPs do not lead to a high metric score. For Llama-3-8B, we find MLP-31 to have a high direct effect, which when greedily combined with other MLP layers obtain higher scores; (ii) Bottom Row (Memory Circuit Components). We find that a large number of attention heads and layers are required to obtain a high metric score. Unlike the context circuit, we find MLPs to be important for the memory circuit.
  • Figure 3: We find that one attention head in the context faithfulness circuit obtains a low entropy value in the context window. Qualitative results shows that this attention head for Vicuna leads to peaky attention values in the context span containing the answer, whereas other attention heads produce either diffused attentions or erroneous attentions. Further results on Llama-3 and Phi-3 in Appendix.
  • Figure 4: Ablating the extracted context-faithfulness circuit leads to a large drop in extractive QA accuracy for various datasets. We ablate the edges from the extracted circuit and a random circuit in the language model and measure the extractive QA accuracy.
  • Figure 5: Attribution through one attention head in our circuit via AttnAttrib obtains strong attribution results. Across various extractive QA benchmarks, we obtain improved performances over different attribution baselines. For HotPotQA, we measure the F1-score due to it being single-hop, whereas for other datasets, we measure the attribution accuracy. We present further results on long-form generations in Sec.(\ref{['long_answer_generations']}) and attribution results on other synthetic datasets in Sec.(\ref{['full_results']})
  • ...and 14 more figures