Table of Contents
Fetching ...

Swing-by Dynamics in Concept Learning and Compositional Generalization

Yongyi Yang, Core Francisco Park, Ekdeep Singh Lubana, Maya Okawa, Wei Hu, Hidenori Tanaka

TL;DR

This work addresses how diffusion models learn compositional concepts and generalize to unseen, out-of-distribution compositions. It introduces Structured Identity Mapping (SIM) as a tractable abstraction to study concept-learning dynamics, formalizing SIM with Gaussian-cluster data and a regression identity task. Through theoretical analysis of one-layer and symmetric two-layer linear models, it explains learning order, terminal slow-down, and a novel Swing-by Dynamics, connecting stage-wise Jacobian evolution to observed phenomena. The authors validate key predictions by training diffusion models on concept-space tasks, showing non-monotonic loss trajectories and exponential deceleration of concept-space learning, thereby bridging theory and practice in compositional generalization. Overall, SIM provides a mechanistic lens on how modern generative models acquire and manipulate concepts, with implications for designing models that generalize compositionally to novel data."

Abstract

Prior work has shown that text-conditioned diffusion models can learn to identify and manipulate primitive concepts underlying a compositional data-generating process, enabling generalization to entirely novel, out-of-distribution compositions. Beyond performance evaluations, these studies develop a rich empirical phenomenology of learning dynamics, showing that models generalize sequentially, respecting the compositional hierarchy of the data-generating process. Moreover, concept-centric structures within the data significantly influence a model's speed of learning the ability to manipulate a concept. In this paper, we aim to better characterize these empirical results from a theoretical standpoint. Specifically, we propose an abstraction of prior work's compositional generalization problem by introducing a structured identity mapping (SIM) task, where a model is trained to learn the identity mapping on a Gaussian mixture with structurally organized centroids. We mathematically analyze the learning dynamics of neural networks trained on this SIM task and show that, despite its simplicity, SIM's learning dynamics capture and help explain key empirical observations on compositional generalization with diffusion models identified in prior work. Our theory also offers several new insights -- e.g., we find a novel mechanism for non-monotonic learning dynamics of test loss in early phases of training. We validate our new predictions by training a text-conditioned diffusion model, bridging our simplified framework and complex generative models. Overall, this work establishes the SIM task as a meaningful theoretical abstraction of concept learning dynamics in modern generative models.

Swing-by Dynamics in Concept Learning and Compositional Generalization

TL;DR

This work addresses how diffusion models learn compositional concepts and generalize to unseen, out-of-distribution compositions. It introduces Structured Identity Mapping (SIM) as a tractable abstraction to study concept-learning dynamics, formalizing SIM with Gaussian-cluster data and a regression identity task. Through theoretical analysis of one-layer and symmetric two-layer linear models, it explains learning order, terminal slow-down, and a novel Swing-by Dynamics, connecting stage-wise Jacobian evolution to observed phenomena. The authors validate key predictions by training diffusion models on concept-space tasks, showing non-monotonic loss trajectories and exponential deceleration of concept-space learning, thereby bridging theory and practice in compositional generalization. Overall, SIM provides a mechanistic lens on how modern generative models acquire and manipulate concepts, with implications for designing models that generalize compositionally to novel data."

Abstract

Prior work has shown that text-conditioned diffusion models can learn to identify and manipulate primitive concepts underlying a compositional data-generating process, enabling generalization to entirely novel, out-of-distribution compositions. Beyond performance evaluations, these studies develop a rich empirical phenomenology of learning dynamics, showing that models generalize sequentially, respecting the compositional hierarchy of the data-generating process. Moreover, concept-centric structures within the data significantly influence a model's speed of learning the ability to manipulate a concept. In this paper, we aim to better characterize these empirical results from a theoretical standpoint. Specifically, we propose an abstraction of prior work's compositional generalization problem by introducing a structured identity mapping (SIM) task, where a model is trained to learn the identity mapping on a Gaussian mixture with structurally organized centroids. We mathematically analyze the learning dynamics of neural networks trained on this SIM task and show that, despite its simplicity, SIM's learning dynamics capture and help explain key empirical observations on compositional generalization with diffusion models identified in prior work. Our theory also offers several new insights -- e.g., we find a novel mechanism for non-monotonic learning dynamics of test loss in early phases of training. We validate our new predictions by training a text-conditioned diffusion model, bridging our simplified framework and complex generative models. Overall, this work establishes the SIM task as a meaningful theoretical abstraction of concept learning dynamics in modern generative models.

Paper Structure

This paper contains 53 sections, 11 theorems, 52 equations, 19 figures.

Key Result

Theorem 4.1

Let ${\boldsymbol{W}}(t) \in \mathbb R^{d \times d}$ be initialized as ${\boldsymbol{W}}(0) = {\boldsymbol{W}}^{(0)}$, and updated by $\dot {\boldsymbol{W}} = - \nabla \mathcal{L}(W)$, with $\mathcal{L}$ be defined by eq:transformed-loss with $f({\boldsymbol{W}},{\boldsymbol{z}}) = {\boldsymbol{W}}{

Figures (19)

  • Figure 1: Structured Identity Mapping Task and Swing-by Generalization Dynamics. (a) Given the input "blue square apples on a tree with circular yellow leaves," a multimodal model learns to generate concepts in the following order: "apple," "blue" (color), and "square" (shape) (example adapted from li2024scalability). (b) A multimodal synthetic task introduced by okawa2024compositionalpark2024emergencehiddencapabilitiesexploring. The training set of the task consists of four distinct compositions of concepts, depicted as blue nodes on a cubic graph. A diffusion model is trained on this dataset to systematically study the dynamics of concept learning. With the test prompt "small, blue, triangle," the diffusion model sequentially learns the correct size, shape, and finally color. (c) In this work, we introduce a structured identity mapping task as the foundation for a systematical and theoretical studying of the dynamics of concept learning. The model is trained on a Gaussian mixture data, where the centroids are positioned at certain nodes of a hyperrectangle (blue dots) and is evaluated on an out-of-distribution test set (red dot). Our theoretical results not only reproduce and explain previously characterized empirical phenomena but also depict a comprehensive picture of the non-monotonic learning dynamics in the concept space and predict a "multiple-descent" curve of the test loss (red curve).
  • Figure 2: Learning dynamics of MLP on SIM task. The figures show the output trajectory of the MLP on a two-dimensional setting (i.e., $s = 2$), and each marker represents an optimization timepoint. Notice that we only plot the center of the training set as a circle, but the actual training set can have varied shapes based on the configuration of ${\boldsymbol{\sigma}}$. (a) one-layer linear model with ${\boldsymbol{\sigma}}_{:2} = (.05,.05)$ and varied ${\boldsymbol{\mu}}$. Concepts $i$ with larger signal ($\mu_i$) learnt first. (b) one-layer linear model with ${\boldsymbol{\mu}}_{:2} = (1,2)$ and varied ${\boldsymbol{\sigma}}$. Concepts $i$ with larger diversity ($\sigma_i$) learnt first. (c) $4$ layer linear models under ${\boldsymbol{\mu}}_{:2} = (1, 2)$ and ${\boldsymbol{\sigma}}_{:2} = (.05,.05)$ and different dimensionality. high dim: $d = 64$, low dim: $d = 2$. Notice that (a) and (b) are both in high dim setting. The lower the dimensionality, the stronger Swing-by it has.
  • Figure 3: The test loss of multi-layer models.
  • Figure 4: An illustration of the entries of the Jacobian.
  • Figure 5: The learning dynamics of a symmetric 2-layer linear model. Left: The change of the test loss and the Jacobian entries with time predicted by the theory; Right: the corresponding model output trajectory. The figures are plotted under $s = 2$ and all entries of ${\boldsymbol{W}}$ are initialized positive.
  • ...and 14 more figures

Theorems & Definitions (19)

  • Theorem 4.1
  • Definition D.1
  • Corollary D.1
  • Lemma D.1: Upper Bounded Growth
  • proof
  • Lemma D.2: Lower Bounded Initial Growth
  • proof
  • Lemma D.3: Lower Bounded Initial Growth for Diagonal Entries
  • Lemma D.4: Lower Bounded After-Initial Growth for Diagonal Entries
  • proof
  • ...and 9 more