A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia

Giovanni Monea; Maxime Peyrard; Martin Josifoski; Vishrav Chaudhary; Jason Eisner; Emre Kıcıman; Hamid Palangi; Barun Patra; Robert West

A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia

Giovanni Monea, Maxime Peyrard, Martin Josifoski, Vishrav Chaudhary, Jason Eisner, Emre Kıcıman, Hamid Palangi, Barun Patra, Robert West

TL;DR

This work investigates how large language models ground their outputs in contextual information when it conflicts with internal parametric knowledge. It introduces Fakepedia, a counterfactual dataset derived from ParaRel, and Fakepedia variants to stress grounding via single-hop and multi-hop reasoning, enabling systematic evaluation of grounding versus factual recall. The authors propose Masked Grouped Causal Tracing (MGCT), a scalable causal mediation method that intervenes on grouped transformer states to identify computation patterns distinguishing grounded from ungrounded responses; MGCT reveals that grounding is distributed across the network, whereas ungrounded responses often hinge on MLPs near the last subject token. They further demonstrate that grounding status can be automatically detected from computation traces with high accuracy (92.8%), using an XGBoost classifier on MGCT features. The Fakepedia dataset and MGCT tooling aim to advance mechanistic understanding of grounding and its interaction with factual recall in in-context learning and retrieval-augmented generation.

Abstract

Large language models (LLMs) have an impressive ability to draw on novel information supplied in their context. Yet the mechanisms underlying this contextual grounding remain unknown, especially in situations where contextual information contradicts factual knowledge stored in the parameters, which LLMs also excel at recalling. Favoring the contextual information is critical for retrieval-augmented generation methods, which enrich the context with up-to-date information, hoping that grounding can rectify outdated or noisy stored knowledge. We present a novel method to study grounding abilities using Fakepedia, a novel dataset of counterfactual texts constructed to clash with a model's internal parametric knowledge. In this study, we introduce Fakepedia, a counterfactual dataset designed to evaluate grounding abilities when the internal parametric knowledge clashes with the contextual information. We benchmark various LLMs with Fakepedia and conduct a causal mediation analysis of LLM components when answering Fakepedia queries, based on our Masked Grouped Causal Tracing (MGCT) method. Through this analysis, we identify distinct computational patterns between grounded and ungrounded responses. We finally demonstrate that distinguishing grounded from ungrounded responses is achievable through computational analysis alone. Our results, together with existing findings about factual recall mechanisms, provide a coherent narrative of how grounding and factual recall mechanisms interact within LLMs.

A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia

TL;DR

Abstract

Paper Structure (27 sections, 2 equations, 5 figures, 3 tables)

This paper contains 27 sections, 2 equations, 5 figures, 3 tables.

Introduction
Related Work
Mechanistic Interpretability
Counterfactual Datasets
Background
Terminology
Grounded vs. Factual
Counterfactual Data Creation
Counterfactual ParaRel
Fakepedia
Descriptive Behavioral Analysis
Causal Analysis of the Computation Graph
MGCT Analysis
MGCT Experiments
Automatic Detection of Ungrounded Responses
...and 12 more sections

Figures (5)

Figure 1: Studying Grounding in LLMs. This work makes three distinct contributions: (a) introducing a counterfactual dataset designed to measure the abilities of LLMs to ground their answer in the new information provided in the prompt, (b) conducting a descriptive analysis of grounding performances across several LLMs, and (c) implementing an improved causal mediation analysis that we use to show that computation patterns inside LLMs can predict whether the answer is grounded.
Figure 2: Masked Grouped Causal Tracing (MGCT). This figure illustrates the mediation analysis from MGCT, refining the preceding causal tracing method from meng2023locating. The process involves three steps: (i) Clean run: all states within the computation graph are recorded during a forward loop, resulting in a predicted token, in this case "Paris". (ii) Corrupted run: the subject tokens are substituted with special non-textual tokens such as <UNK> or <EOS>, leading to a distinct probability for the predicted token. (iii) Restored run: the corrupted with the restoration of a group of states (in this instance, four hidden activations) to their values from the clean run, resulting in a partially restored probability for the predicted token. The indirect effect is estimated by the extent to which the restoration of these states contributes to the probability restoration of the predicted token.
Figure 3: Masked Grouped Causal Tracing analysis on Fakepedia-base. This figure illustrates the application of MGCT analysis to GPT2-XL and Llama2-7B on the Fakepedia-base dataset. We distinguish between instances where the models generated grounded answers and those where they generated ungrounded answers. In the MGCT analysis, we restore full columns together, all states across all layers for a given column at a time, resulting in one effect per token. On the y-axis, we report the percentage of explained change in probability between the clean and corrupted runs due to the restoration of the column. To average across different sequences, we bucketed tokens into subject (subj-) categories and following tokens in the prompt (cont-). Red labels on the x-axis indicate that the difference in MGCT effect between grounded and ungrounded responses is statistically significant based on a t-test with a p-value threshold of 0.01.
Figure 4: Masked Grouped Causal Tracing analysis on Fakepedia-MH.This figure illustrates the application of MGCT analysis to GPT2-XL and Llama2-7B on the Fakepedia-MH dataset. We distinguish between instances where the models generated grounded answers and those where they generated ungrounded answers. In the MGCT analysis, we restore full columns together, all states across all layers for a given column at a time, resulting in one effect per token. On the y-axis, we report the percentage of explained change in probability between the clean and corrupted runs due to the restoration of the column. To average across different sequences, we bucketed tokens into subject (subj-) categories and following tokens in the prompt (cont-). Red labels on the x-axis indicate that the difference in MGCT effect between grounded and ungrounded responses is statistically significant based on a t-test with a p-value threshold of 0.01.
Figure 5: Masked Grouped Causal Tracing analysis on LLaMA.This figure illustrates the application of MGCT analysis to LLaMA-7B on Fakepedia dataset. We distinguish between instances where the models generated grounded answers and those where they generated ungrounded answers. In the MGCT analysis, we restore full columns together, all states across all layers for a given column at a time, resulting in one effect per token. On the y-axis, we report the percentage of explained change in probability between the clean and corrupted runs due to the restoration of the column. To average across different sequences, we bucketed tokens into subject (subj-) categories and following tokens in the prompt (cont-). Red labels on the x-axis indicate that the difference in MGCT effect between grounded and ungrounded responses is statistically significant based on a t-test with a p-value threshold of 0.01.

A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia

TL;DR

Abstract

A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia

Authors

TL;DR

Abstract

Table of Contents

Figures (5)