Table of Contents
Fetching ...

Tug-of-war between idioms' figurative and literal interpretations in LLMs

Soyoung Oh, Xinting Huang, Mathis Pink, Michael Hahn, Vera Demberg

TL;DR

This work investigates how autoregressive transformers understand idioms that have both figurative and literal meanings. Using causal tracing, knockout, and activation patching on models such as Llama3, the authors identify a three‑stage mechanism: early retrieval of figurative interpretations in initial layers, rapid use of preceding context to bias interpretation, and a dual routing of interpretations via an intermediate figurative path and a bypass literal path. They pinpoint idiom‑specific attention heads and early MLPs that causally promote figurative readings, and show that context disambiguation unfolds through MHSA across mid‑layers, ultimately aligning with the context and maintaining both readings for robust final predictions. The findings provide a detailed mechanistic picture of idiom processing in transformers, with implications for improving figurative language understanding and interpretability in large language models.

Abstract

Idioms present a unique challenge for language models due to their non-compositional figurative interpretations, which often strongly diverge from the idiom's literal interpretation. In this paper, we employ causal tracing to systematically analyze how pretrained causal transformers deal with this ambiguity. We localize three mechanisms: (i) Early sublayers and specific attention heads retrieve an idiom's figurative interpretation, while suppressing its literal interpretation. (ii) When disambiguating context precedes the idiom, the model leverages it from the earliest layer and later layers refine the interpretation if the context conflicts with the retrieved interpretation. (iii) Then, selective, competing pathways carry both interpretations: an intermediate pathway prioritizes the figurative interpretation and a parallel direct route favors the literal interpretation, ensuring that both readings remain available. Our findings provide mechanistic evidence for idiom comprehension in autoregressive transformers.

Tug-of-war between idioms' figurative and literal interpretations in LLMs

TL;DR

This work investigates how autoregressive transformers understand idioms that have both figurative and literal meanings. Using causal tracing, knockout, and activation patching on models such as Llama3, the authors identify a three‑stage mechanism: early retrieval of figurative interpretations in initial layers, rapid use of preceding context to bias interpretation, and a dual routing of interpretations via an intermediate figurative path and a bypass literal path. They pinpoint idiom‑specific attention heads and early MLPs that causally promote figurative readings, and show that context disambiguation unfolds through MHSA across mid‑layers, ultimately aligning with the context and maintaining both readings for robust final predictions. The findings provide a detailed mechanistic picture of idiom processing in transformers, with implications for improving figurative language understanding and interpretability in large language models.

Abstract

Idioms present a unique challenge for language models due to their non-compositional figurative interpretations, which often strongly diverge from the idiom's literal interpretation. In this paper, we employ causal tracing to systematically analyze how pretrained causal transformers deal with this ambiguity. We localize three mechanisms: (i) Early sublayers and specific attention heads retrieve an idiom's figurative interpretation, while suppressing its literal interpretation. (ii) When disambiguating context precedes the idiom, the model leverages it from the earliest layer and later layers refine the interpretation if the context conflicts with the retrieved interpretation. (iii) Then, selective, competing pathways carry both interpretations: an intermediate pathway prioritizes the figurative interpretation and a parallel direct route favors the literal interpretation, ensuring that both readings remain available. Our findings provide mechanistic evidence for idiom comprehension in autoregressive transformers.

Paper Structure

This paper contains 24 sections, 8 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: The figurative and literal interpretations are highlighted in the blocks and paths. We find three main steps for idiom processing: Idiom retrieval step: Early layers (i.e., layers 0-3) attention and MLP are actively retrieving the idiom's figurative interpretation while storing both figurative and literal interpretations in the residual stream. Selective interpretation step: At the token immediately following the idiom span, the model begins to encode a representation that favors the figurative interpretation over the literal one, starting from the middle layers. Interpretation routing: For final prediction, the model passes literal interpretation via both a direct compositional semantic path ( ), as well as the intermediate pathway that prioritizes the figurative interpretation ( figurative path).
  • Figure 2: Sublayer-wise interpretation shift $\Delta I(s)$ after ablating activations at idiom span, for sentences $s \in \{s_a, s_f, s_l\}$. Y-axis: Mean values of $\Delta L(s_a)$, $\Delta F(s_a)$, $\Delta L(s_l)$, $\Delta F(s_f)$ with 95% confidence intervals. X-axis: Layers. Gray dashed line:$\Delta I = 0$ (no effect). Red asterisk (*): Significant difference between $\Delta F(s_a)$ and the others (paired $t$-test, $p<0.05$). The difference at * marked layer is larger than the average difference across all layers.
  • Figure 3: Heatmaps of the (a) $\Delta F(s_a)$ (b) $\Delta L(s_a)$ when ablating individual attention heads at the idiom span. Idiomatic heads: Heads those are crucial for retrieving the figurative interpretation of idiom; $-\Delta F(s_a)$ and $+\Delta L(s_a)$.
  • Figure 4: Kernel alignment between hidden states ($\mathrm{x}$) extracted from four different token positions of $s_a$ and semantic embeddings of paraphrases ($s_f, \, s_l$).
  • Figure 5: Conceptual description of the activation patching experiments for tracing information flow (L = literal interpretation; F = figurative interpretation).
  • ...and 9 more figures