Overthinking the Truth: Understanding how Language Models Process False Demonstrations

Danny Halawi; Jean-Stanislas Denain; Jacob Steinhardt

Overthinking the Truth: Understanding how Language Models Process False Demonstrations

Danny Halawi, Jean-Stanislas Denain, Jacob Steinhardt

TL;DR

The paper investigates why language models imitate incorrect demonstrations in few-shot contexts, introducing the concepts of overthinking and false induction heads. By applying a logit-lens analysis to intermediate layers and using a permuted-labels contrast, it locates a critical layer where incorrect information begins to dominate and shows that late-attention heads drive this effect. They demonstrate causality by zeroing late layers and ablating false induction heads, achieving substantial reduction in the accuracy gap across multiple datasets with minimal impact on correct-prompt performance. The findings offer a scalable, intermediate-level framework for understanding and mitigating harmful in-context learning behaviors, with implications for prompt design and guardrail strategies.

Abstract

Modern language models can imitate complex patterns through few-shot learning, enabling them to complete challenging tasks without fine-tuning. However, imitation can also lead models to reproduce inaccuracies or harmful content if present in the context. We study harmful imitation through the lens of a model's internal representations, and identify two related phenomena: "overthinking" and "false induction heads". The first phenomenon, overthinking, appears when we decode predictions from intermediate layers, given correct vs. incorrect few-shot demonstrations. At early layers, both demonstrations induce similar model behavior, but the behavior diverges sharply at some "critical layer", after which the accuracy given incorrect demonstrations progressively decreases. The second phenomenon, false induction heads, are a possible mechanistic cause of overthinking: these are heads in late layers that attend to and copy false information from previous demonstrations, and whose ablation reduces overthinking. Beyond scientific understanding, our results suggest that studying intermediate model computations could be a promising avenue for understanding and guarding against harmful model behaviors.

Overthinking the Truth: Understanding how Language Models Process False Demonstrations

TL;DR

Abstract

Paper Structure (21 sections, 3 equations, 26 figures, 7 tables)

This paper contains 21 sections, 3 equations, 26 figures, 7 tables.

Introduction
Related Work
Preliminaries: Few-shot Learning with False Demonstrations
False demonstration labels decrease accuracy
Random and partially correct labels lead to lower accuracy than correct labels
Zeroing Later Layers Improves Accuracy
Zooming into attention heads
Discussion
Appendix
Logit lens results for other models
Calibration
Logit lens results for other models across tasks
Logit lens results for GPT-J without calibration
Logit lens results for each SST-2 prompt format
Logit lens results for other metrics
...and 6 more sections

Figures (26)

Figure 1: Left: Given a prompt of incorrect demonstrations, language models are more likely to output incorrect labels. Center: When demonstrations are incorrect, zeroing out the later layers increases the classification accuracy, here on Financial-Phrasebank. Right: We identify 5 attention heads and remove them from the model: this reduces the effect of incorrect demonstrations by 32.6% on Financial-Phrasebank, without decreasing the accuracy given correct demonstrations.
Figure 2: GPT-J behavior in the permuted labels setting (\ref{['permuted-labels-setup']}). Left: The difference in accuracy between correct and incorrect prompts increases with the number of demonstrations. Right: As the number of false demonstrations increases, the model chooses the permuted label $\sigma(\text{class}(x))$ more often than the other labels, rather than making random errors.
Figure 3: GPT-J early-exit classification accuracies across 6 task categories, given accurate and inaccurate demonstrations (here in the permuted labels setting). Plots are grouped by task type: sentiment analysis (a-b), hate speech detection (c), paraphrase detection (d), natural language inference (e), topic classification (f-g), and a toy task (h). Given incorrect demonstrations, zeroing out all transformer blocks after layer 16 outperforms running the entire model.
Figure 4: Average calibrated accuracy across 14 tasks for GPT2-XL (a), GPT-J (b), and GPT-NeoX (c). Early-exiting outperforms running the entire model when the demonstrations contain permuted, random, or half correct labels.
Figure 5: Examples of attention patterns on incorrect demonstrations from the toy Unnatural dataset, for heads that are label-attending but not class-sensitive (Left), heads that are class-sensitive but not label-attending (Center), and heads that are both label-attending and class-sensitive (Right).
...and 21 more figures

Overthinking the Truth: Understanding how Language Models Process False Demonstrations

TL;DR

Abstract

Overthinking the Truth: Understanding how Language Models Process False Demonstrations

Authors

TL;DR

Abstract

Table of Contents

Figures (26)