Table of Contents
Fetching ...

Racing Thoughts: Explaining Contextualization Errors in Large Language Models

Michael A. Lepori, Michael C. Mozer, Asma Ghandeharioun

TL;DR

This work tackles why large language models mis-contextualize polysemous terms by proposing the LLM Race Conditions Hypothesis, which attributes contextualization failures to dependencies that require correct ordering of information across transformer layers. It combines a controlled QA task with distractors, Patchscopes interventions, and multi-model analyses to identify a critical window around mid-layers where subject-entity contextualization must occur for correct final outputs. Through attention-mass analyses, logit lens, causal ablations, and patching interventions (cross-patching, backpatching, and frozen variants), the authors provide both correlational and causal evidence across Gemma-2 and Llama-2 that supports the hypothesis and demonstrates practical mitigation strategies. The results highlight a fundamental limitation of purely feedforward transformers for robust contextualization and point to potential remedies, including architectural changes (recurrent connections) and inference-time interventions to improve reliability in context-dependent reasoning.

Abstract

The profound success of transformer-based language models can largely be attributed to their ability to integrate relevant contextual information from an input sequence in order to generate a response or complete a task. However, we know very little about the algorithms that a model employs to implement this capability, nor do we understand their failure modes. For example, given the prompt "John is going fishing, so he walks over to the bank. Can he make an ATM transaction?", a model may incorrectly respond "Yes" if it has not properly contextualized "bank" as a geographical feature, rather than a financial institution. We propose the LLM Race Conditions Hypothesis as an explanation of contextualization errors of this form. This hypothesis identifies dependencies between tokens (e.g., "bank" must be properly contextualized before the final token, "?", integrates information from "bank"), and claims that contextualization errors are a result of violating these dependencies. Using a variety of techniques from mechanistic intepretability, we provide correlational and causal evidence in support of the hypothesis, and suggest inference-time interventions to address it.

Racing Thoughts: Explaining Contextualization Errors in Large Language Models

TL;DR

This work tackles why large language models mis-contextualize polysemous terms by proposing the LLM Race Conditions Hypothesis, which attributes contextualization failures to dependencies that require correct ordering of information across transformer layers. It combines a controlled QA task with distractors, Patchscopes interventions, and multi-model analyses to identify a critical window around mid-layers where subject-entity contextualization must occur for correct final outputs. Through attention-mass analyses, logit lens, causal ablations, and patching interventions (cross-patching, backpatching, and frozen variants), the authors provide both correlational and causal evidence across Gemma-2 and Llama-2 that supports the hypothesis and demonstrates practical mitigation strategies. The results highlight a fundamental limitation of purely feedforward transformers for robust contextualization and point to potential remedies, including architectural changes (recurrent connections) and inference-time interventions to improve reliability in context-dependent reasoning.

Abstract

The profound success of transformer-based language models can largely be attributed to their ability to integrate relevant contextual information from an input sequence in order to generate a response or complete a task. However, we know very little about the algorithms that a model employs to implement this capability, nor do we understand their failure modes. For example, given the prompt "John is going fishing, so he walks over to the bank. Can he make an ATM transaction?", a model may incorrectly respond "Yes" if it has not properly contextualized "bank" as a geographical feature, rather than a financial institution. We propose the LLM Race Conditions Hypothesis as an explanation of contextualization errors of this form. This hypothesis identifies dependencies between tokens (e.g., "bank" must be properly contextualized before the final token, "?", integrates information from "bank"), and claims that contextualization errors are a result of violating these dependencies. Using a variety of techniques from mechanistic intepretability, we provide correlational and causal evidence in support of the hypothesis, and suggest inference-time interventions to address it.
Paper Structure (26 sections, 2 equations, 12 figures, 2 tables)

This paper contains 26 sections, 2 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: (Left) A contextualization error found in Gemini, a frontier LLM team2023gemini. In this dialogue, the user's first message implies that the correct word sense of bank is as a geographical feature (i.e., a river bank). Though the model recognizes this in its first reply, it fails to maintain this word sense of bank when probed in the very next user message, instead defaulting to the interpretation of bank as a financial institution. (Right) Illustrating the LLM Race Conditions Hypothesis. The Race Conditions Hypothesis suggests that contextualization errors result from out-of-order contextualization over the layers of an LLM. In this case, the question tokens are contextualizing with the polysemous word bank in an early layer, before its word sense is resolved via contextualizing with cue tokens.
  • Figure 2: Behavioral results across all three datasets for Gemma-2. We find that injecting distractors routinely engenders contextualization errors.
  • Figure 3: Attention mass over layers for all datasets for Gemma-2. We observe an inverse U-shape over layers, suggesting that the question tokens might only incorporate information present in the subject entity in the middle layers of processing.
  • Figure 4: Logit lens results for Gemma-2 over all three datasets. For each layer, we plot the difference in mean logit differences between the 'yes' token and 'no' token between questions that the model answered correctly vs. incorrectly. Each dataset is disaggregated to separate questions where the correct answer is 'yes' from those where the correct answer is 'no.' This metric demonstrates the impact of successful contextualization on the final question token over layers. For all datasets and partitions, we find that the model's ultimate answer becomes identifiable around layer 20.
  • Figure 5: Attention Ablation Results for Gemma-2. We find that ablating cue tokens and distractor tokens both have the intended impact on performance -- ablating cue tokens drops performance and ablating distractors increases performance. Most notably, however, we find that these interventions only impact model performance in the first half of layers.
  • ...and 7 more figures