Table of Contents
Fetching ...

Characterizing Pattern Matching and Its Limits on Compositional Task Structures

Hoyeon Chang, Jinho Park, Hanseul Cho, Sohee Yang, Miyoung Ko, Hyeonbin Hwang, Seungpil Won, Dohaeng Lee, Youbin Ahn, Minjoon Seo

TL;DR

This work formalizes pattern matching in neural generalization as functional-equivalence-based generalization within a data-driven framework, introducing $k$-equivalence, $k$-coverage, and a substitution graph to define the boundary of pattern-matching capabilities. It shows that instance-wise success correlates with the amount of supporting contexts, and proves a tight data-scaling law for a two-hop structure, $N_{\mathrm{req}} = \tilde{\Theta}(n^c)$ with $c=2.5-0.5/k$, a result robust across architectures up to a 20x parameter increase and across tasks (2-Hop, 3-Hop, etc.). The study identifies path ambiguity as a structural barrier where multiple computation paths prevent unified intermediate-state representations, and demonstrates that Chain-of-Thought reduces data requirements but does not fully resolve this issue. A taxonomy of generalization mechanisms is proposed to distinguish functional-equivalence-based pattern matching from property-based and shared-operator generalization, offering a principled diagnostic for when pattern matching can account for generalization and guiding targeted data augmentation and future research on non-pattern-matching mechanisms.

Abstract

Despite impressive capabilities, LLMs' successes often rely on pattern-matching behaviors, yet these are also linked to OOD generalization failures in compositional tasks. However, behavioral studies commonly employ task setups that allow multiple generalization sources (e.g., algebraic invariances, structural repetition), obscuring a precise and testable account of how well LLMs perform generalization through pattern matching and their limitations. To address this ambiguity, we first formalize pattern matching as functional equivalence, i.e., identifying pairs of subsequences of inputs that consistently lead to identical results when the rest of the input is held constant. Then, we systematically study how decoder-only Transformer and Mamba behave in controlled tasks with compositional structures that isolate this mechanism. Our formalism yields predictive and quantitative insights: (1) Instance-wise success of pattern matching is well predicted by the number of contexts witnessing the relevant functional equivalence. (2) We prove a tight sample complexity bound of learning a two-hop structure by identifying the exponent of the data scaling law for perfect in-domain generalization. Our empirical results align with the theoretical prediction, under 20x parameter scaling and across architectures. (3) Path ambiguity is a structural barrier: when a variable influences the output via multiple paths, models fail to form unified intermediate state representations, impairing accuracy and interpretability. (4) Chain-of-Thought reduces data requirements yet does not resolve path ambiguity. Hence, we provide a predictive, falsifiable boundary for pattern matching and a foundational diagnostic for disentangling mixed generalization mechanisms.

Characterizing Pattern Matching and Its Limits on Compositional Task Structures

TL;DR

This work formalizes pattern matching in neural generalization as functional-equivalence-based generalization within a data-driven framework, introducing -equivalence, -coverage, and a substitution graph to define the boundary of pattern-matching capabilities. It shows that instance-wise success correlates with the amount of supporting contexts, and proves a tight data-scaling law for a two-hop structure, with , a result robust across architectures up to a 20x parameter increase and across tasks (2-Hop, 3-Hop, etc.). The study identifies path ambiguity as a structural barrier where multiple computation paths prevent unified intermediate-state representations, and demonstrates that Chain-of-Thought reduces data requirements but does not fully resolve this issue. A taxonomy of generalization mechanisms is proposed to distinguish functional-equivalence-based pattern matching from property-based and shared-operator generalization, offering a principled diagnostic for when pattern matching can account for generalization and guiding targeted data augmentation and future research on non-pattern-matching mechanisms.

Abstract

Despite impressive capabilities, LLMs' successes often rely on pattern-matching behaviors, yet these are also linked to OOD generalization failures in compositional tasks. However, behavioral studies commonly employ task setups that allow multiple generalization sources (e.g., algebraic invariances, structural repetition), obscuring a precise and testable account of how well LLMs perform generalization through pattern matching and their limitations. To address this ambiguity, we first formalize pattern matching as functional equivalence, i.e., identifying pairs of subsequences of inputs that consistently lead to identical results when the rest of the input is held constant. Then, we systematically study how decoder-only Transformer and Mamba behave in controlled tasks with compositional structures that isolate this mechanism. Our formalism yields predictive and quantitative insights: (1) Instance-wise success of pattern matching is well predicted by the number of contexts witnessing the relevant functional equivalence. (2) We prove a tight sample complexity bound of learning a two-hop structure by identifying the exponent of the data scaling law for perfect in-domain generalization. Our empirical results align with the theoretical prediction, under 20x parameter scaling and across architectures. (3) Path ambiguity is a structural barrier: when a variable influences the output via multiple paths, models fail to form unified intermediate state representations, impairing accuracy and interpretability. (4) Chain-of-Thought reduces data requirements yet does not resolve path ambiguity. Hence, we provide a predictive, falsifiable boundary for pattern matching and a foundational diagnostic for disentangling mixed generalization mechanisms.

Paper Structure

This paper contains 55 sections, 27 theorems, 122 equations, 20 figures, 2 tables, 1 algorithm.

Key Result

Theorem 6.1

Consider a 2-Hop task with a token set of size $n$. For a uniformly randomly sampled train dataset $D$ of size $N$, consider a learner that generalizes within the $k$-coverage of $D$. Then, for large enough $n$, the learner achieves perfect ID generalization with high probability if $N \!\gtrsim\! n

Figures (20)

  • Figure 1: Illustration of functional equivalence.Left: In a two-hop task $(x_1, x_2, x_3) \mapsto t$ with $t=f_2(f_1(x_1,x_2),x_3)$, two fragments $(x_1,x_2)$ and $(x_1',x_2')$ satisfying $f_1(x_1,x_2)=f_1(x_1',x_2')=b$ consistently yield the same final output when combined with the same context $x_3$, supporting their functional equivalence. Right: Among all possible inputs (few shown), we draw an edge between any two inputs that differ only by functionally equivalent fragments to form a substitution graph. Then, coverage is the set of observed inputs (highlighted as blue) and all inputs connected to them. We define pattern matching as a type of generalization that occurs inside the coverage, harnessing functional equivalence.
  • Figure 2: Four synthetic task structures we study.
  • Figure 3: Left: Percentage of covered ID data depending on $k$ values and dataset size ($N$), for 2-Hop task ($\left|{\mathcal{X}}\right|=50$). Right: Test accuracy depending on $k$-cutoff values for 2-Hop task ($\left|{\mathcal{X}}\right|=50$, $N$=10k). Each line represents a different training checkpoint. Note that out-of-coverage ($k=0$) accuracy remains at chance level ($\approx 1/50$) regardless of training time. The bars below show the number of test data for each $k$-cutoff value.
  • Figure 4: Left: Heatmap of Intra-Inter Cosine Gap (IICG) across layers and positions, sliced by $k$-cutoff. Higher IICG values indicate stronger clustering of representations that share the same intermediate state. The positions with the highest IICG values are marked with squares. Right: PCA visualization of latent representations at position $x_2$ and layer 3. Datapoints are classified by their intermediate states $b=f_1(x_1,x_2)$.
  • Figure 5: Left: Log-log plot of measured $\hat{N}_{\mathrm{req}}$ vs. token set size ($\left|{\mathcal{X}}\right|$) across three compositional tasks. The slope $c$ corresponds to the empirical power-law scaling exponent. Omitted points for 3-Hop are due to prohibitively large dataset requirements. Right: Power-law scaling behavior on 2-Hop task across varying GPT-2 model sizes (68M to 1.5B parameters) and Mamba model (For Mamba, we used 4 layers, a hidden dimension of 256, and a learning rate of 0.008, and $\hat{N}_{\mathrm{req}}$ is measured for only $\left|{\mathcal{X}}\right|\le100$, since a larger token set size led to training instability). $R^2>0.99$ for all linear fitting.
  • ...and 15 more figures

Theorems & Definitions (50)

  • Definition 3.1: Functional $k$-equivalence
  • Definition 3.2: $k$-coverage
  • Theorem 6.1: Informal; \ref{['cor:complexity_upper_bound', 'cor:complexity_upper_bound_k_ge_2_tightness']}
  • Proposition F.1
  • Definition F.2: In-domain closure, in terms of $I=\{1,2\}$
  • Definition F.3: Substitution graph, in terms of $I=\{1,2\}$
  • Definition F.4: $k$-coverage, in terms of $I=\{1,2\}$
  • Definition F.5: Evidence graphs
  • Theorem F.7: Sample Complexity Upper Bound, $k\ge 2$, \ref{['pf:thm:complexity_upper_bound_k_ge_2']}
  • Theorem F.8: Sample Complexity Upper Bound, $k=1$, \ref{['pf:thm:complexity_upper_bound_k_1']}
  • ...and 40 more