Table of Contents
Fetching ...

Don't Think of the White Bear: Ironic Negation in Transformer Models Under Cognitive Load

Logan Mann, Nayan Saxena, Sarah Tandon, Chenhao Sun, Savar Toteja, Kevin Zhu

TL;DR

The paper investigates ironic rebound in transformer models when warned not to mention a concept, focusing on how cognitive load from distractors and framing polarity affect suppression. It introduces ReboundBench and conducts two experiments to quantify rebound using log-probability, surprisal, suppression, and polarity metrics, complemented by circuit tracing of attention heads. Key findings show rebound occurs immediately after negation, grows with semantic distractors, and persists longer in models with stronger polarity separation, driven by a sparse set of middle-layer heads that amplify forbidden tokens despite early-layer suppression. These results reveal a fragile yet mechanistically interpretable negation dynamic in LLMs, with implications for safety filters and alignment and a path toward targeted mitigations guided by internal circuit insights.

Abstract

Negation instructions such as 'do not mention $X$' can paradoxically increase the accessibility of $X$ in human thought, a phenomenon known as ironic rebound. Large language models (LLMs) face the same challenge: suppressing a concept requires internally activating it, which may prime rebound instead of avoidance. We investigated this tension with two experiments. \textbf{(1) Load \& content}: after a negation instruction, we vary distractor text (semantic, syntactic, repetition) and measure rebound strength. \textbf{(2) Polarity separation}: We test whether models distinguish neutral from negative framings of the same concept and whether this separation predicts rebound persistence. Results show that rebound consistently arises immediately after negation and intensifies with longer or semantic distractors, while repetition supports suppression. Stronger polarity separation correlates with more persistent rebound. Together, these findings, complemented by a circuit tracing analysis that identifies sparse middle-layer attention heads amplifying forbidden tokens while early layers suppress, link cognitive predictions of ironic rebound with mechanistic insights into long-context interference. To support future work, we release ReboundBench, a dataset of $5,000$ systematically varied negation prompts designed to probe rebound in LLMs.

Don't Think of the White Bear: Ironic Negation in Transformer Models Under Cognitive Load

TL;DR

The paper investigates ironic rebound in transformer models when warned not to mention a concept, focusing on how cognitive load from distractors and framing polarity affect suppression. It introduces ReboundBench and conducts two experiments to quantify rebound using log-probability, surprisal, suppression, and polarity metrics, complemented by circuit tracing of attention heads. Key findings show rebound occurs immediately after negation, grows with semantic distractors, and persists longer in models with stronger polarity separation, driven by a sparse set of middle-layer heads that amplify forbidden tokens despite early-layer suppression. These results reveal a fragile yet mechanistically interpretable negation dynamic in LLMs, with implications for safety filters and alignment and a path toward targeted mitigations guided by internal circuit insights.

Abstract

Negation instructions such as 'do not mention ' can paradoxically increase the accessibility of in human thought, a phenomenon known as ironic rebound. Large language models (LLMs) face the same challenge: suppressing a concept requires internally activating it, which may prime rebound instead of avoidance. We investigated this tension with two experiments. \textbf{(1) Load \& content}: after a negation instruction, we vary distractor text (semantic, syntactic, repetition) and measure rebound strength. \textbf{(2) Polarity separation}: We test whether models distinguish neutral from negative framings of the same concept and whether this separation predicts rebound persistence. Results show that rebound consistently arises immediately after negation and intensifies with longer or semantic distractors, while repetition supports suppression. Stronger polarity separation correlates with more persistent rebound. Together, these findings, complemented by a circuit tracing analysis that identifies sparse middle-layer attention heads amplifying forbidden tokens while early layers suppress, link cognitive predictions of ironic rebound with mechanistic insights into long-context interference. To support future work, we release ReboundBench, a dataset of systematically varied negation prompts designed to probe rebound in LLMs.

Paper Structure

This paper contains 27 sections, 4 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Change in surprisal ($\Delta$ bits) of the forbidden token relative to a high-load baseline as normalized distractor length increases: semantic, syntactic, repetition
  • Figure 2: Trade-off between rebound magnitude (AUC) and persistence ($L_{50}$, the load at which rebound halves)
  • Figure 3: How suppression and amplification effects change across model layers. Early layers suppress, middle layers show mixed effects with emerging rebound, and later layers stabilize.
  • Figure 4: Circuit map of the most influential attention heads. Circle size indicates effect magnitude; color indicates direction (red = amplification, green = suppression). Effects cluster in the middle layers.
  • Figure B2.1: GPT-2-Small
  • ...and 9 more figures