Table of Contents
Fetching ...

How Do Language Models Compose Functions?

Apoorv Khandelwal, Ellie Pavlick

TL;DR

The paper investigates whether large language models solve two-hop compositional tasks via explicit compositional mechanisms or idiomatic shortcuts, framing tasks as $y = g(f(x))$ and examining intermediate representations. Using logit lens analyses of residual streams and embedding-space linearity tests, it uncovers two processing modes—compositional (with a detectable $f(x)$ intermediate) and direct (no signature of intermediates)—with the mode chosen in part by the linearity of the embedding-relations between $x$ and $g(f(x))$. The key findings show a persistent compositionality gap across modern models, even as model size and reasoning capabilities reduce the gap for some tasks; embedding-space linearity strongly predicts the dominance of idiomatic processing. The work highlights the nuanced relationship between representation geometry and processing strategies in LLMs, offering interpretability-based insights into when compositional reasoning emerges and suggesting causal interventions as a route to further understanding. These results have implications for theories of compositionality and generalization, indicating that effective compositional behavior can arise from non-symbolic mechanisms shaped by pretraining rather than explicit symbolic architectures.

Abstract

While large language models (LLMs) appear to be increasingly capable of solving compositional tasks, it is an open question whether they do so using compositional mechanisms. In this work, we investigate how feedforward LLMs solve two-hop factual recall tasks, which can be expressed compositionally as $g(f(x))$. We first confirm that modern LLMs continue to suffer from the "compositionality gap": i.e. their ability to compute both $z = f(x)$ and $y = g(z)$ does not entail their ability to compute the composition $y = g(f(x))$. Then, using logit lens on their residual stream activations, we identify two processing mechanisms, one which solves tasks $\textit{compositionally}$, computing $f(x)$ along the way to computing $g(f(x))$, and one which solves them $\textit{directly}$, without any detectable signature of the intermediate variable $f(x)$. Finally, we find that which mechanism is employed appears to be related to the embedding space geometry, with the idiomatic mechanism being dominant in cases where there exists a linear mapping from $x$ to $g(f(x))$ in the embedding spaces. We fully release our data and code at: https://github.com/apoorvkh/composing-functions .

How Do Language Models Compose Functions?

TL;DR

The paper investigates whether large language models solve two-hop compositional tasks via explicit compositional mechanisms or idiomatic shortcuts, framing tasks as and examining intermediate representations. Using logit lens analyses of residual streams and embedding-space linearity tests, it uncovers two processing modes—compositional (with a detectable intermediate) and direct (no signature of intermediates)—with the mode chosen in part by the linearity of the embedding-relations between and . The key findings show a persistent compositionality gap across modern models, even as model size and reasoning capabilities reduce the gap for some tasks; embedding-space linearity strongly predicts the dominance of idiomatic processing. The work highlights the nuanced relationship between representation geometry and processing strategies in LLMs, offering interpretability-based insights into when compositional reasoning emerges and suggesting causal interventions as a route to further understanding. These results have implications for theories of compositionality and generalization, indicating that effective compositional behavior can arise from non-symbolic mechanisms shaped by pretraining rather than explicit symbolic architectures.

Abstract

While large language models (LLMs) appear to be increasingly capable of solving compositional tasks, it is an open question whether they do so using compositional mechanisms. In this work, we investigate how feedforward LLMs solve two-hop factual recall tasks, which can be expressed compositionally as . We first confirm that modern LLMs continue to suffer from the "compositionality gap": i.e. their ability to compute both and does not entail their ability to compute the composition . Then, using logit lens on their residual stream activations, we identify two processing mechanisms, one which solves tasks , computing along the way to computing , and one which solves them , without any detectable signature of the intermediate variable . Finally, we find that which mechanism is employed appears to be related to the embedding space geometry, with the idiomatic mechanism being dominant in cases where there exists a linear mapping from to in the embedding spaces. We fully release our data and code at: https://github.com/apoorvkh/composing-functions .

Paper Structure

This paper contains 35 sections, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Compositionality gap for Llama 3 (3B) on our tasks. Red bar represents examples for which the model is able to solve all causal hops, out of all examples (absolute). Blue and yellow bars are relative to the red bar: they show proportions of examples out of those in the red bar. Blue represents the same examples as red and yellow represents those for which the model is able to additionally solve the composition. Correlation between red and yellow bars is $r^2 = 0.00$.
  • Figure 2: Compositionality gap (dashed purple line; lower is better) of various models aggregated over 4 tasks (100 examples each). Blue, yellow, and red lines show proportions of examples for which models correctly solve combinations of hops and the composition. Purple line shows the relative gap between yellow and red: the proportion of examples for which the model cannot solve the composition, out of those for which it can solve all hops. "-I" indicates the instruction-tuned variant of Llama 3 (405B). Error bands show interquartile range.
  • Figure 3: (a--b) Processing signatures aggregated over examples (across all tasks) in which Llama 3 (3B) solves all hops correctly, but the composition (a) correctly or (b) incorrectly. (c--f) Processing signatures for particular tasks --- aggregated over examples where the model correctly solves all hops and the composition. (a--f) Lines show reciprocal ranks of relevant variables (decoded using logit lens) from residual streams corresponding to $x \to g(f(x))$. Intermediate variables are shown with dashed lines. The incorrect composition, $f(g(x))$, is shown by the red line when not distinct from $g(f(x))$.
  • Figure 4: (a) Strong correlation across tasks between presence of intermediate variables (heuristic from \ref{['sec:lens-exp']} based on reciprocal rank; on average across examples) and embedding space linearity ($r^2 = 0.53$). Conversely, accuracy is weakly correlated with these intermediate variable ($r^2 = 0.22$) and linearity ($r^2 = 0.13$) metrics. (b) Distribution of examples for each task, shown as a histogram of intermediate variable reciprocal ranks. (a--b) Colors refer to corresponding tasks between points in (a) and histograms in (b).
  • Figure 5: We illustrate the monotonically diminishing improvements to the compositionality gap resulting from increased model size (layers and parameters). We re-visualize results for the OLMo 2 and Llama 3 model families from \ref{['fig:compositionality-gap-by-model']}.
  • ...and 10 more figures