On Mechanistic Circuits for Extractive Question-Answering
Samyadeep Basu, Vlad Morariu, Zichao Wang, Ryan Rossi, Cherry Zhao, Soheil Feizi, Varun Manjunatha
TL;DR
The paper investigates mechanistic circuits inside large language models to understand extractive QA, distinguishing when answers rely on retrieved context versus parametric memory. It introduces a CMA-based framework to extract two circuits—Context-Faithfulness and Memory-Faithfulness—from multiple models, revealing that a small set of attention heads drive context attribution. Building on this, it presents AttnAttrib, a single-head attribution method that achieves strong data attribution across benchmarks and can be used to steer models toward context-faithful answering in a forward pass. The work demonstrates practical applications for grounding and reliability in context-augmented QA and shows generalizability to larger models, offering a blueprint for leveraging mechanistic insights in real-world deployments.
Abstract
Large language models are increasingly used to process documents and facilitate question-answering on them. In our paper, we extract mechanistic circuits for this real-world language modeling task: context-augmented language modeling for extractive question-answering (QA) tasks and understand the potential benefits of circuits towards downstream applications such as data attribution to context information. We extract circuits as a function of internal model components (e.g., attention heads, MLPs) using causal mediation analysis techniques. Leveraging the extracted circuits, we first understand the interplay between the model's usage of parametric memory and retrieved context towards a better mechanistic understanding of context-augmented language models. We then identify a small set of attention heads in our circuit which performs reliable data attribution by default, thereby obtaining attribution for free in just the model's forward pass. Using this insight, we then introduce ATTNATTRIB, a fast data attribution algorithm which obtains state-of-the-art attribution results across various extractive QA benchmarks. Finally, we show the possibility to steer the language model towards answering from the context, instead of the parametric memory by using the attribution from ATTNATTRIB as an additional signal during the forward pass. Beyond mechanistic understanding, our paper provides tangible applications of circuits in the form of reliable data attribution and model steering.
