Table of Contents
Fetching ...

When does compositional structure yield compositional generalization? A kernel theory

Samuel Lippl, Kim Stachenfeld

TL;DR

The paper develops a kernel-theoretic account of when compositional structure yields compositional generalization. It shows that compositionally structured kernel models implement conjunction-wise additivity, constraining generalization to sums over observed conjunctions and excluding non-additive relations like transitive equivalence. The authors link generalization to representational geometry via overlap salience and identify memorization leak and shortcut bias as key data-driven failure modes. Empirically, the theory captures qualitative behavior of deep networks (CNNs, ResNets, ViTs) on a range of compositional tasks, offering a framework for diagnosing and improving generalization in structured settings.

Abstract

Compositional generalization (the ability to respond correctly to novel combinations of familiar components) is thought to be a cornerstone of intelligent behavior. Compositionally structured (e.g. disentangled) representations support this ability; however, the conditions under which they are sufficient for the emergence of compositional generalization remain unclear. To address this gap, we present a theory of compositional generalization in kernel models with fixed, compositionally structured representations. This provides a tractable framework for characterizing the impact of training data statistics on generalization. We find that these models are limited to functions that assign values to each combination of components seen during training, and then sum up these values ("conjunction-wise additivity"). This imposes fundamental restrictions on the set of tasks compositionally structured kernel models can learn, in particular preventing them from transitively generalizing equivalence relations. Even for compositional tasks that they can learn in principle, we identify novel failure modes in compositional generalization (memorization leak and shortcut bias) that arise from biases in the training data. Finally, we empirically validate our theory, showing that it captures the behavior of deep neural networks (convolutional networks, residual networks, and Vision Transformers) trained on a set of compositional tasks with similarly structured data. Ultimately, this work examines how statistical structure in the training data can affect compositional generalization, with implications for how to identify and remedy failure modes in deep learning models.

When does compositional structure yield compositional generalization? A kernel theory

TL;DR

The paper develops a kernel-theoretic account of when compositional structure yields compositional generalization. It shows that compositionally structured kernel models implement conjunction-wise additivity, constraining generalization to sums over observed conjunctions and excluding non-additive relations like transitive equivalence. The authors link generalization to representational geometry via overlap salience and identify memorization leak and shortcut bias as key data-driven failure modes. Empirically, the theory captures qualitative behavior of deep networks (CNNs, ResNets, ViTs) on a range of compositional tasks, offering a framework for diagnosing and improving generalization in structured settings.

Abstract

Compositional generalization (the ability to respond correctly to novel combinations of familiar components) is thought to be a cornerstone of intelligent behavior. Compositionally structured (e.g. disentangled) representations support this ability; however, the conditions under which they are sufficient for the emergence of compositional generalization remain unclear. To address this gap, we present a theory of compositional generalization in kernel models with fixed, compositionally structured representations. This provides a tractable framework for characterizing the impact of training data statistics on generalization. We find that these models are limited to functions that assign values to each combination of components seen during training, and then sum up these values ("conjunction-wise additivity"). This imposes fundamental restrictions on the set of tasks compositionally structured kernel models can learn, in particular preventing them from transitively generalizing equivalence relations. Even for compositional tasks that they can learn in principle, we identify novel failure modes in compositional generalization (memorization leak and shortcut bias) that arise from biases in the training data. Finally, we empirically validate our theory, showing that it captures the behavior of deep neural networks (convolutional networks, residual networks, and Vision Transformers) trained on a set of compositional tasks with similarly structured data. Ultimately, this work examines how statistical structure in the training data can affect compositional generalization, with implications for how to identify and remedy failure modes in deep learning models.
Paper Structure (62 sections, 7 theorems, 53 equations, 19 figures)

This paper contains 62 sections, 7 theorems, 53 equations, 19 figures.

Key Result

Proposition 4.1

For a random weights neural network $\phi$ with a compositionally structured input $x$, in the infinite-width limit, $\phi(x)$ is also compositionally structured.

Figures (19)

  • Figure 1: Overview of main theoretical findings (\ref{['sec:main-theorem']}). a, We consider inputs with several categorical components and study compositional generalization to novel combinations of components. b, We assume compositionally structured representations for which trials with the same number of overlaps have identical similarities. c, We find that a random weights neural network conserves compositional structure: if its input is compositionally structured then so is its output. d, We find that compositionally structured kernel models are constrained to adding up values for each conjunction seen during training ("conjunction-wise additivity").
  • Figure 2: The compositional task space. a, Example training sets for symbolic addition. The grid represents pairs of components with associated values $-4,\dotsc,4$. The training set consists of certain rows and columns of this grid. b, Context dependence. In context 1, feat. 1 determines the category; in context 2, feat. 2 determines the category. The training sets leave out subsets of the lower right orthant. c, Transitive equivalence: six items are split up into two equivalence classes (e.g. A,B,C and D,E,F) and generalization requires transitive inference over equivalence classes (e.g. $A=B$ and $B=C$ implies $A=C$). d, Our theory partitions the task space into conjunction-wise additive (which can be solved by kernel models) and non-additive tasks (which cannot).
  • Figure 3: Representational salience (for three input components) in a random weight neural network with variable numbers of layers and nonlinearities.
  • Figure 4: Kernel models' behavior on (a,b) symbolic addition and (c,d) context dependence. a, Model predictions on training and test set plotted against ground truth for an example case ($S(1;2)=0.4$, $\mathcal{W}=\{0\}$). b, The slope of the test set as a function of $S(1;2)$ and the training set size $p=|\mathcal{W}|$. c, Generalization on context dependence as a function of representational salience. Trajectories of networks with different nonlinearities are highlighted (color scale see \ref{['fig:salience']}). d, Coefficients of the different conjunction types for two example networks with three layers and different nonlinearities.
  • Figure 5: Testing our theory in deep networks trained on MNIST and CIFAR versions of compositional tasks. Ranges indicate mean $\pm$ one std. error (often too small to be visible). a, $S(1;2)$ in an intermediate ConvNet layer for different distances between digits. Lower distance yields a more conjunctive representation. b, Average model prediction for each combination of components plotted against the ground truth (MNIST, distance of zero, $\mathcal{W}=\{0\}$). Generalization on the compositional split is distorted by a proportional factor. c, d, Slope of this linear relationship across all datasets as a function of c, distance (each line corresponding to a particular $\mathcal{W}$) and d, training set (for a distance of zero). For MSE instead of slope, see \ref{['fig:supp-mse']}. e, Accuracy on all variants of context dependence.
  • ...and 14 more figures

Theorems & Definitions (14)

  • Definition 3.1
  • Proposition 4.1
  • Theorem 4.1
  • Proposition 5.1
  • Definition A.1
  • Theorem A.1
  • proof
  • Proposition A.2
  • proof
  • Definition B.1
  • ...and 4 more