Table of Contents
Fetching ...

From RAGs to rich parameters: Probing how language models utilize external knowledge over parametric information for factual queries

Hitesh Wadhwa, Rahul Seetharaman, Somyaa Aggarwal, Reshmi Ghosh, Samyadeep Basu, Soundararajan Srinivasan, Wenlong Zhao, Shreyas Chaudhari, Ehsan Aghazadeh

TL;DR

Retrieval Augmented Generation (RAG) is widely used to supplement language models with external context, but how it competes with parametric memory remains unclear. This study mechanistically probes LLaMa-2 and Phi-2 on 1209 factual queries, comparing vanilla and RAG-equipped prompts using causal tracing, attention contributions, and attention knockouts. The results show that parametric memory is minimally utilized when retrieved context is present, and the last-token residual stream derives most signal from context tokens rather than the subject token, with Attention Contributions from the subject token dropping and context-driven edges dominating. These findings suggest a pronounced shortcut behavior where external knowledge supersedes internal knowledge, with implications for RAG design, reliability, and future work on longer contexts and instruction-tuned models.

Abstract

Retrieval Augmented Generation (RAG) enriches the ability of language models to reason using external context to augment responses for a given user prompt. This approach has risen in popularity due to practical applications in various applications of language models in search, question/answering, and chat-bots. However, the exact nature of how this approach works isn't clearly understood. In this paper, we mechanistically examine the RAG pipeline to highlight that language models take shortcut and have a strong bias towards utilizing only the context information to answer the question, while relying minimally on their parametric memory. We probe this mechanistic behavior in language models with: (i) Causal Mediation Analysis to show that the parametric memory is minimally utilized when answering a question and (ii) Attention Contributions and Knockouts to show that the last token residual stream do not get enriched from the subject token in the question, but gets enriched from other informative tokens in the context. We find this pronounced shortcut behaviour true across both LLaMa and Phi family of models.

From RAGs to rich parameters: Probing how language models utilize external knowledge over parametric information for factual queries

TL;DR

Retrieval Augmented Generation (RAG) is widely used to supplement language models with external context, but how it competes with parametric memory remains unclear. This study mechanistically probes LLaMa-2 and Phi-2 on 1209 factual queries, comparing vanilla and RAG-equipped prompts using causal tracing, attention contributions, and attention knockouts. The results show that parametric memory is minimally utilized when retrieved context is present, and the last-token residual stream derives most signal from context tokens rather than the subject token, with Attention Contributions from the subject token dropping and context-driven edges dominating. These findings suggest a pronounced shortcut behavior where external knowledge supersedes internal knowledge, with implications for RAG design, reliability, and future work on longer contexts and instruction-tuned models.

Abstract

Retrieval Augmented Generation (RAG) enriches the ability of language models to reason using external context to augment responses for a given user prompt. This approach has risen in popularity due to practical applications in various applications of language models in search, question/answering, and chat-bots. However, the exact nature of how this approach works isn't clearly understood. In this paper, we mechanistically examine the RAG pipeline to highlight that language models take shortcut and have a strong bias towards utilizing only the context information to answer the question, while relying minimally on their parametric memory. We probe this mechanistic behavior in language models with: (i) Causal Mediation Analysis to show that the parametric memory is minimally utilized when answering a question and (ii) Attention Contributions and Knockouts to show that the last token residual stream do not get enriched from the subject token in the question, but gets enriched from other informative tokens in the context. We find this pronounced shortcut behaviour true across both LLaMa and Phi family of models.
Paper Structure (20 sections, 3 equations, 8 figures)

This paper contains 20 sections, 3 equations, 8 figures.

Figures (8)

  • Figure 1: Setup of a factual QA system with RAG, utilized in this paper, for understanding the usefulness of parameteric knowledge stored in LlaMa and Phi.
  • Figure 2: Language models minimally rely on the MLP parametric memory in the presence of retrieved context. From left to right: Average Indirect Effect from MLPs after corrupting subject + context for scenario based on RAG and subject for vanilla-case. Here, FST=First Subject Token, MST=Middle Subject Tokens, LST= Last Subject Token, FSST=First Subsequent Token, FT= Further Tokens, LT= Last Token. On average 5 times decrease in AIE is observed for LST with RAG vs. vanilla, signalling decrease in usage of MLP when RAG context present.
  • Figure 3: The last token residual stream obtains less enriched information from the subject token in the query in the presence of retrieved context.(a) Subject Token contribution for RAG vs vanilla in Llama-2, (b) Comparison of subject and attribute contributions w/ RAG for Llama-2, (c) Subject contribution for RAG vs vanilla in Phi-2, (d) Comparison of subject and attribute contributions w/ RAG for Phi-2. 4a. and 4c indicates subject contribution is twice as lower in case of RAG as compared to vanilla. 4b and 4d shows that attribute token's attention contribution is 5 times more than the subject contribution.
  • Figure 4: In the presence of retrieved context, knocking out attention weights from the subject in query to the last token has minimal effect. (Left) Llama2 (Right) Phi2. [Knocking out attribute tokens decreases probability upto 25%in Phi2 and 20% in Llama2 and only 5% probability is reduced on knocking out subject token attention.]
  • Figure 5: Attention knockouts in LLaMa - vanilla setting
  • ...and 3 more figures