Quantifying reliance on external information over parametric knowledge during Retrieval Augmented Generation (RAG) using mechanistic analysis
Reshmi Ghosh, Rahul Seetharaman, Hitesh Wadhwa, Somyaa Aggarwal, Samyadeep Basu, Soundararajan Srinivasan, Wenlong Zhao, Shreyas Chaudhari, Ehsan Aghazadeh
TL;DR
The paper investigates how Retrieval Augmented Generation (RAG) shifts reliance from parametric memory to non-parametric retrieved context in LMs. It introduces three mechanistic probes—Causal Tracing with $IE(h^{(l)}_i) = P^*_{clean}(h^{(l)}_i)[y] - P^*[y]$ and $AIE = \mathbb{E}_{prompt}[IE(h^{(l)}_i)]$, Attention Contributions via $||a^{(\ell)}_{i,T}||$, and Attention Knockouts—to quantify information flow from retrieved context and the subject token to final predictions. Across LLaMa-2 and Phi-2 on the Knowns 1000 dataset with GPT-4–generated context, the results reveal a pronounced 'shortcut' bias toward retrieved information and a substantial reduction in the influence of parametric memory, evidenced by decreased LST AIE and weakened ST→LT information flow. This work highlights the impact of RAG on factual reasoning and offers guidance for designing robust retrieval-augmented systems, while outlining directions for scaling to larger models and longer contexts.
Abstract
Retrieval Augmented Generation (RAG) is a widely used approach for leveraging external context in several natural language applications such as question answering and information retrieval. Yet, the exact nature in which a Language Model (LM) leverages this non-parametric memory or retrieved context isn't clearly understood. This paper mechanistically examines the RAG pipeline to highlight that LMs demonstrate a "shortcut'' effect and have a strong bias towards utilizing the retrieved context to answer questions, while relying minimally on model priors. We propose (a) Causal Mediation Analysis; for proving that parametric memory is minimally utilized when answering a question and (b) Attention Contributions and Knockouts for showing the last token residual stream do not get enriched from the subject token in the question, but gets enriched from tokens of RAG-context. We find this pronounced "shortcut'' behaviour to be true across both LLMs (e.g.,LlaMa) and SLMs (e.g., Phi)
