Table of Contents
Fetching ...

Local Mechanisms of Compositional Generalization in Conditional Diffusion

Arwen Bradley

Abstract

Conditional diffusion models appear capable of compositional generalization, i.e., generating convincing samples for out-of-distribution combinations of conditioners, but the mechanisms underlying this ability remain unclear. To make this concrete, we study length generalization, the ability to generate images with more objects than seen during training. In a controlled CLEVR setting (Johnson et al., 2017), we find that length generalization is achievable in some cases but not others, suggesting that models only sometimes learn the underlying compositional structure. We then investigate locality as a structural mechanism for compositional generalization. Prior works proposed score locality as a mechanism for creativity in unconditional diffusion models (Kamb & Ganguli, 2024; Niedoba et al., 2024), but did not address flexible conditioning or compositional generalization. In this paper, we prove an exact equivalence between a specific compositional structure ("conditional projective composition") (Bradley et al., 2025) and scores with sparse dependencies on both pixels and conditioners ("local conditional scores"). This theory also extends to feature-space compositionality. We validate our theory empirically: CLEVR models that succeed at length generalization exhibit local conditional scores, while those that fail do not. Furthermore, we show that a causal intervention explicitly enforcing local conditional scores restores length generalization in a previously failing model. Finally, we investigate SDXL and find that in pixel-space, spatial locality is present but conditional-locality is mostly absent; however, we find quantitative evidence of local conditional scores in the network's learned feature-space.

Local Mechanisms of Compositional Generalization in Conditional Diffusion

Abstract

Conditional diffusion models appear capable of compositional generalization, i.e., generating convincing samples for out-of-distribution combinations of conditioners, but the mechanisms underlying this ability remain unclear. To make this concrete, we study length generalization, the ability to generate images with more objects than seen during training. In a controlled CLEVR setting (Johnson et al., 2017), we find that length generalization is achievable in some cases but not others, suggesting that models only sometimes learn the underlying compositional structure. We then investigate locality as a structural mechanism for compositional generalization. Prior works proposed score locality as a mechanism for creativity in unconditional diffusion models (Kamb & Ganguli, 2024; Niedoba et al., 2024), but did not address flexible conditioning or compositional generalization. In this paper, we prove an exact equivalence between a specific compositional structure ("conditional projective composition") (Bradley et al., 2025) and scores with sparse dependencies on both pixels and conditioners ("local conditional scores"). This theory also extends to feature-space compositionality. We validate our theory empirically: CLEVR models that succeed at length generalization exhibit local conditional scores, while those that fail do not. Furthermore, we show that a causal intervention explicitly enforcing local conditional scores restores length generalization in a previously failing model. Finally, we investigate SDXL and find that in pixel-space, spatial locality is present but conditional-locality is mostly absent; however, we find quantitative evidence of local conditional scores in the network's learned feature-space.

Paper Structure

This paper contains 42 sections, 6 theorems, 42 equations, 17 figures, 4 tables.

Key Result

Lemma 1

[lemma]lem:lcs_exact_for_pc A distribution $p(x|c_{\mathcal{J}})$ is a pixel-space Conditional Projective Composition (def:cpc) with disjoint sets $\{M_j\}_{j \in \mathcal{J}_{\text{all}}}$ if and only if its score $s(x|c_{\mathcal{J}})$ is a Local Conditional Score (def:lcs) with subsets

Figures (17)

  • Figure 1: Length generalization in location-conditioned CLEVR models. We study length generalization in location-conditioned models trained on images with 1-3 objects, and tested on 1, 3, 6, 9 locations (6, 9 are OOD), with red dots indicating the conditioned locations at test-time. For each experiment, the rows correspond to different conditioners (1, 3, 6, or 9 locations) and the columns show 4 different samples. All models have the same architectures and training data and differ only in the design of their conditioners (see \ref{['fig:loc_cond']}). In Experiment 1, a grid-style conditioner labels the locations of all objects in the scene; the model successfully length-generalizes up to 7 locations. In Experiment 2, a grid-style conditioner labels the location of only a single object (randomly selected); the model fails to length-generalize (in this case, even 3 locations is OOD). In Experiment 3, a list-style conditioner labels the locations of all objects; this model fails to length-generalize beyond 3 objects. Additional samples shown in \ref{['fig:exp123_full']}. Experiment 2L applies a causal intervention to the failing Exp. 2: we modify the model architecture to explicitly enforce local conditional scores, use the same training data and conditioning as Exp. 2, and find that Exp. 2L length-generalizes while Exp. 2 failed (see also \ref{['app:exp3L']} for an analogous Exp. 3L).
  • Figure 2: (Left) Locality in location-conditioned CLEVR models. For Experiments 1, 2 and 3 of \ref{['fig:clevr_locat_len_gen']} each conditioned on four locations, we visualize pixel-locality via heatmaps, and conditional locality via the intensity of the $\times$ marker, centered at a pixel in the lower left, over a range of timesteps. (\ref{['app:grad_details']} describes the locality measurements; \ref{['fig:exp_123_metrics']} plots locality metrics; \ref{['fig:clevr_loc_full']} shows more pixel locations.) The length-generalizing Exp. 1 model exhibits strong pixel- and conditional-locality, while the non-length-generalizing Exp. 2 and 3 models both lack conditional-locality (the scores depend on non-local conditioners); Exp. 3 also lacks pixel-locality. These experiments support the theoretical equivalence between CPC and LCS. (Right) Length generalization vs. conditioner locality for several models (different colors), each checkpointed early, mid, and late in training (different shapes). Details are in \ref{['app:scatter_colors']}. Length generalization and conditional locality are strongly correlated, and can emerge together over the course of training (e.g. orange, green, red models). Here, length-generalization ($x$-axis) is the number of locations to which the model can generalize ($K_{\text{max}}$ in \ref{['table:xy-learned-counts']}) minus the maximum number on which it was trained (e.g. +4 for a model trained on 1-3 locations that generalizes to 7). The conditional locality ($y$-axis) metric is described in \ref{['app:grad_details']}.
  • Figure 3: (Left) Conditional projective composition (CPC) and local conditional scores (LCS). A CPC is a conditional distribution over a set of conditions $c_\mathcal{J}$ that factorizes independently into the marginals over $x_{M_j}$ conditioned on $c_j$, where $M_j$ are disjoint subsets. An LCS is a conditional score over a set of conditions $c_\mathcal{J}$ such that the score at each pixel $i$ depends only on a subset $N_i$ of other pixels (often a local neighborhood) as well as a subset $L_i \subset \mathcal{J}$ of conditions (for location-conditioning, often nearby conditioners). For certain choices of subsets, CPC and LCS are equivalent.
  • Figure 4: SDXL pixel-space locality. (Left) An SDXL-generated image for the prompt "a beautiful photograph with a horse in the middle, a dog on the left, and a cat on the right," with four analysis locations marked. (Center) Pixel gradient magnitude heatmaps at low, mid, and high noise for each location, showing pixel-locality (localized cross patterns) especially at low noise. (Right) Per-word conditional influence (ablation) over time at two pixel locations: the curves are nearly identical despite the pixels being in different spatial regions, indicating a lack of conditional-locality in pixel-space. Averaged over 10 seeds; additional seeds and prompts in \ref{['fig:sdxl_pix_extra_grad']}, spatial word-influence heatmaps in \ref{['fig:sdxl_pix_extra']}.
  • Figure 5: Feature-space disentanglement in SDXL. (Top left) Within- vs. between-category mean cosine similarity (F-LCS heuristic, \ref{['lem:f-lcs_heuristic']}) for hidden-state activations in the down, mid, and up blocks (at high noise) and for output spaces (latent/VAE, pixel). Feature-space representations are more disentangled (larger within/between gap) than output spaces. (Top right) Per-layer disentanglement ratio (within/between cosine similarity) across all transformer layers, showing an arch-shaped profile peaking near the U-net bottleneck. (Middle) $8\times 8$ cosine similarity heatmaps for 8 representative concepts across feature-space (down, mid, up blocks) and output spaces (latent, pixel). Block-diagonal category structure (animals, art styles, foods) is most pronounced in feature-space. (Bottom) Example SDXL compositions: concepts that are disentangled in feature-space compose more successfully. Methodology: 24 prompts across 4 categories, 10 seeds, hidden states (post-FFN activations from BasicTransformerBlock) at high noise; see \ref{['app:sdxl_detail']} and \ref{['fig:sdxl_full_feat_space']}.
  • ...and 12 more figures

Theorems & Definitions (18)

  • Definition 1: Local Conditional Score (LCS)
  • Definition 2: (Pixel-space) Conditional Projective Composition (CPC)
  • Lemma 1: Equivalence of CPC and LCS
  • Corollary 1: F-LCS is exact for F-CPC; informal
  • Lemma 2: F-LCS necessary-but-not-sufficient heuristic
  • Remark 1
  • Remark 2
  • proof
  • Definition 3: Approximate CPC
  • Lemma 3: LCS approximates score of approximate-CPC
  • ...and 8 more