A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia
Giovanni Monea, Maxime Peyrard, Martin Josifoski, Vishrav Chaudhary, Jason Eisner, Emre Kıcıman, Hamid Palangi, Barun Patra, Robert West
TL;DR
This work investigates how large language models ground their outputs in contextual information when it conflicts with internal parametric knowledge. It introduces Fakepedia, a counterfactual dataset derived from ParaRel, and Fakepedia variants to stress grounding via single-hop and multi-hop reasoning, enabling systematic evaluation of grounding versus factual recall. The authors propose Masked Grouped Causal Tracing (MGCT), a scalable causal mediation method that intervenes on grouped transformer states to identify computation patterns distinguishing grounded from ungrounded responses; MGCT reveals that grounding is distributed across the network, whereas ungrounded responses often hinge on MLPs near the last subject token. They further demonstrate that grounding status can be automatically detected from computation traces with high accuracy (92.8%), using an XGBoost classifier on MGCT features. The Fakepedia dataset and MGCT tooling aim to advance mechanistic understanding of grounding and its interaction with factual recall in in-context learning and retrieval-augmented generation.
Abstract
Large language models (LLMs) have an impressive ability to draw on novel information supplied in their context. Yet the mechanisms underlying this contextual grounding remain unknown, especially in situations where contextual information contradicts factual knowledge stored in the parameters, which LLMs also excel at recalling. Favoring the contextual information is critical for retrieval-augmented generation methods, which enrich the context with up-to-date information, hoping that grounding can rectify outdated or noisy stored knowledge. We present a novel method to study grounding abilities using Fakepedia, a novel dataset of counterfactual texts constructed to clash with a model's internal parametric knowledge. In this study, we introduce Fakepedia, a counterfactual dataset designed to evaluate grounding abilities when the internal parametric knowledge clashes with the contextual information. We benchmark various LLMs with Fakepedia and conduct a causal mediation analysis of LLM components when answering Fakepedia queries, based on our Masked Grouped Causal Tracing (MGCT) method. Through this analysis, we identify distinct computational patterns between grounded and ungrounded responses. We finally demonstrate that distinguishing grounded from ungrounded responses is achievable through computational analysis alone. Our results, together with existing findings about factual recall mechanisms, provide a coherent narrative of how grounding and factual recall mechanisms interact within LLMs.
