Anatomy of an Idiom: Tracing Non-Compositionality in Language Models
Andrew Gomes
TL;DR
This work probes how transformer models process idioms by employing a modified path patching approach to discover minimal computational circuits underlying idiom comprehension. It reveals a two-phase pattern: early non-compositional processing with cross-token interactions in $0$--$2$ and later semantic integration on the final idiom token in $3$--$5$, supported by specialized Idiom Heads and an augmented reception mechanism that links early and late attention. A custom single-corruption path-patching workflow with thresholding and circuit merging identifies idiom-specific circuits and reveals that idiom representations reside in distinct QK directions per idiom rather than a universal axis. These findings advance mechanistic interpretability of non-compositional language in transformers and provide a framework for studying more complex grammatical phenomena.
Abstract
We investigate the processing of idiomatic expressions in transformer-based language models using a novel set of techniques for circuit discovery and analysis. First discovering circuits via a modified path patching algorithm, we find that idiom processing exhibits distinct computational patterns. We identify and investigate ``Idiom Heads,'' attention heads that frequently activate across different idioms, as well as enhanced attention between idiom tokens due to earlier processing, which we term ``augmented reception.'' We analyze these phenomena and the general features of the discovered circuits as mechanisms by which transformers balance computational efficiency and robustness. Finally, these findings provide insights into how transformers handle non-compositional language and suggest pathways for understanding the processing of more complex grammatical constructions.
