Don't Think of the White Bear: Ironic Negation in Transformer Models Under Cognitive Load
Logan Mann, Nayan Saxena, Sarah Tandon, Chenhao Sun, Savar Toteja, Kevin Zhu
TL;DR
The paper investigates ironic rebound in transformer models when warned not to mention a concept, focusing on how cognitive load from distractors and framing polarity affect suppression. It introduces ReboundBench and conducts two experiments to quantify rebound using log-probability, surprisal, suppression, and polarity metrics, complemented by circuit tracing of attention heads. Key findings show rebound occurs immediately after negation, grows with semantic distractors, and persists longer in models with stronger polarity separation, driven by a sparse set of middle-layer heads that amplify forbidden tokens despite early-layer suppression. These results reveal a fragile yet mechanistically interpretable negation dynamic in LLMs, with implications for safety filters and alignment and a path toward targeted mitigations guided by internal circuit insights.
Abstract
Negation instructions such as 'do not mention $X$' can paradoxically increase the accessibility of $X$ in human thought, a phenomenon known as ironic rebound. Large language models (LLMs) face the same challenge: suppressing a concept requires internally activating it, which may prime rebound instead of avoidance. We investigated this tension with two experiments. \textbf{(1) Load \& content}: after a negation instruction, we vary distractor text (semantic, syntactic, repetition) and measure rebound strength. \textbf{(2) Polarity separation}: We test whether models distinguish neutral from negative framings of the same concept and whether this separation predicts rebound persistence. Results show that rebound consistently arises immediately after negation and intensifies with longer or semantic distractors, while repetition supports suppression. Stronger polarity separation correlates with more persistent rebound. Together, these findings, complemented by a circuit tracing analysis that identifies sparse middle-layer attention heads amplifying forbidden tokens while early layers suppress, link cognitive predictions of ironic rebound with mechanistic insights into long-context interference. To support future work, we release ReboundBench, a dataset of $5,000$ systematically varied negation prompts designed to probe rebound in LLMs.
