Tug-of-war between idioms' figurative and literal interpretations in LLMs
Soyoung Oh, Xinting Huang, Mathis Pink, Michael Hahn, Vera Demberg
TL;DR
This work investigates how autoregressive transformers understand idioms that have both figurative and literal meanings. Using causal tracing, knockout, and activation patching on models such as Llama3, the authors identify a three‑stage mechanism: early retrieval of figurative interpretations in initial layers, rapid use of preceding context to bias interpretation, and a dual routing of interpretations via an intermediate figurative path and a bypass literal path. They pinpoint idiom‑specific attention heads and early MLPs that causally promote figurative readings, and show that context disambiguation unfolds through MHSA across mid‑layers, ultimately aligning with the context and maintaining both readings for robust final predictions. The findings provide a detailed mechanistic picture of idiom processing in transformers, with implications for improving figurative language understanding and interpretability in large language models.
Abstract
Idioms present a unique challenge for language models due to their non-compositional figurative interpretations, which often strongly diverge from the idiom's literal interpretation. In this paper, we employ causal tracing to systematically analyze how pretrained causal transformers deal with this ambiguity. We localize three mechanisms: (i) Early sublayers and specific attention heads retrieve an idiom's figurative interpretation, while suppressing its literal interpretation. (ii) When disambiguating context precedes the idiom, the model leverages it from the earliest layer and later layers refine the interpretation if the context conflicts with the retrieved interpretation. (iii) Then, selective, competing pathways carry both interpretations: an intermediate pathway prioritizes the figurative interpretation and a parallel direct route favors the literal interpretation, ensuring that both readings remain available. Our findings provide mechanistic evidence for idiom comprehension in autoregressive transformers.
