Table of Contents
Fetching ...

Anatomy of an Idiom: Tracing Non-Compositionality in Language Models

Andrew Gomes

TL;DR

This work probes how transformer models process idioms by employing a modified path patching approach to discover minimal computational circuits underlying idiom comprehension. It reveals a two-phase pattern: early non-compositional processing with cross-token interactions in $0$--$2$ and later semantic integration on the final idiom token in $3$--$5$, supported by specialized Idiom Heads and an augmented reception mechanism that links early and late attention. A custom single-corruption path-patching workflow with thresholding and circuit merging identifies idiom-specific circuits and reveals that idiom representations reside in distinct QK directions per idiom rather than a universal axis. These findings advance mechanistic interpretability of non-compositional language in transformers and provide a framework for studying more complex grammatical phenomena.

Abstract

We investigate the processing of idiomatic expressions in transformer-based language models using a novel set of techniques for circuit discovery and analysis. First discovering circuits via a modified path patching algorithm, we find that idiom processing exhibits distinct computational patterns. We identify and investigate ``Idiom Heads,'' attention heads that frequently activate across different idioms, as well as enhanced attention between idiom tokens due to earlier processing, which we term ``augmented reception.'' We analyze these phenomena and the general features of the discovered circuits as mechanisms by which transformers balance computational efficiency and robustness. Finally, these findings provide insights into how transformers handle non-compositional language and suggest pathways for understanding the processing of more complex grammatical constructions.

Anatomy of an Idiom: Tracing Non-Compositionality in Language Models

TL;DR

This work probes how transformer models process idioms by employing a modified path patching approach to discover minimal computational circuits underlying idiom comprehension. It reveals a two-phase pattern: early non-compositional processing with cross-token interactions in -- and later semantic integration on the final idiom token in --, supported by specialized Idiom Heads and an augmented reception mechanism that links early and late attention. A custom single-corruption path-patching workflow with thresholding and circuit merging identifies idiom-specific circuits and reveals that idiom representations reside in distinct QK directions per idiom rather than a universal axis. These findings advance mechanistic interpretability of non-compositional language in transformers and provide a framework for studying more complex grammatical phenomena.

Abstract

We investigate the processing of idiomatic expressions in transformer-based language models using a novel set of techniques for circuit discovery and analysis. First discovering circuits via a modified path patching algorithm, we find that idiom processing exhibits distinct computational patterns. We identify and investigate ``Idiom Heads,'' attention heads that frequently activate across different idioms, as well as enhanced attention between idiom tokens due to earlier processing, which we term ``augmented reception.'' We analyze these phenomena and the general features of the discovered circuits as mechanisms by which transformers balance computational efficiency and robustness. Finally, these findings provide insights into how transformers handle non-compositional language and suggest pathways for understanding the processing of more complex grammatical constructions.

Paper Structure

This paper contains 18 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Cosine similarity analysis for the idiom a piece of cake and its corrupted variants, using the meaning string $M = \textit{That was easy}$. For each string $S = \textit{That was a ...}$, the cosine similarity between the final-token embedding vectors $\vec{x}_S^\ell$ and $\vec{x}_M^\ell$ is computed across all layers $\ell$. By layer 4, the idiom string is significantly more aligned with the meaning string than are the corrupted strings, demonstrating the effectiveness of the corruptions for isolating network components involved in its processing. This plot is representative of all idioms analyzed in this work (Table \ref{['tab:attention_effects']}).
  • Figure 2: A single-corruption ($\textit{piece}\to\textit{chunk}$) circuit discovered for the idiom a piece of cake using $I = \textit{That was a piece of cake}$, $M = \textit{That was easy}$, $L = 4$, and $\tau = 0.005$. The yellow triangles represent embedding vectors, the green circles are the post-MLP residual stream, and the orange squares are attention heads (numbered 0--7 in Gemma 2--2B). Edges are colored red for performance drops and blue for gains (antagonistic components), with thickness proportional to the magnitude of the effect. Residual-to-residual edges are shown for completeness but are never patched. Cross-token edges are labeled K for Key, V for Value, and KV for both. The unambiguous Query edges are not labeled.
  • Figure 3: Threshold sweep for the $\textit{cake}\to\textit{cupcake}$ corruption of a piece of cake using $I = \textit{That was a piece of cake}$, $C = \textit{That was a piece of cupcake}$, $M = \textit{That was easy}$, and $L = 4$. The final-token cosine similarity between $C$ and $M$ (blue) and the log--number of edges (red) in the discovered circuit $H_C^\tau$ are plotted against the threshold $\tau$. In this case we would choose $\tau_* = 0.007$.
  • Figure 4: The pruned circuit $\tilde{H}_I$ of the idiom a piece of cake using $I = \textit{That was a piece of cake}$, $M = \textit{That was easy}$, and $L = 4$. The corruptions (thresholds) used are $\textit{piece}\to\textit{chunk/slice}$ (both $0.005$) and $\textit{cake}\to\textit{cupcake/pie}$ (both $0.007$). Notice how cross-token processing takes place in layers 0--2, before the idiom's figurative meaning is resolved in layers 3--4 (Figure \ref{['fig:cosine-sim']}).
  • Figure 5: Circuits for the idiom kicked the bucket using $I = \textit{He kicked the bucket}$, $M = \textit{He died}$, and $L = 4$. The corruptions (thresholds) used are $\textit{the}\to\textit{a}$ ($0.010$) and $\textit{the}\to\textit{this}$ ($0.009$). Notice how attention heads $(2,0)$ and $(1,5)$, respectively, on the final token bucket have incoming Query edges, demonstrating augmented reception.
  • ...and 1 more figures