Table of Contents
Fetching ...

Compositional Generalization via Forced Rendering of Disentangled Latents

Qiyao Liang, Daoyuan Qian, Liu Ziyin, Ila Fiete

TL;DR

The paper investigates why disentangled latent representations often fail to support robust compositional generalization, using a controlled 2D Gaussian bump task to reveal layerwise re-entanglement and memorization as key failure modes. By applying kernel and transport analyses, the authors show that factorized latents do not guarantee extrapolation when downstream layers warp the representation and memorize training data. They propose two practical remedies—architectural rendering of disentangled factors into the output space with low-rank embedding regularization, and data curricula that isolate factors as pixel-space building blocks (e.g., stripes)—which yield data-efficient, strong OOD compositional generalization. The findings emphasize that factorization must be preserved in the output representation, guiding design principles for more reliable, compositional neural models applicable to vision and beyond.

Abstract

Composition-the ability to generate myriad variations from finite means-is believed to underlie powerful generalization. However, compositional generalization remains a key challenge for deep learning. A widely held assumption is that learning disentangled (factorized) representations naturally supports this kind of extrapolation. Yet, empirical results are mixed, with many generative models failing to recognize and compose factors to generate out-of-distribution (OOD) samples. In this work, we investigate a controlled 2D Gaussian "bump" generation task with fully disentangled (x,y) inputs, demonstrating that standard generative architectures still fail in OOD regions when training with partial data, by re-entangling latent representations in subsequent layers. By examining the model's learned kernels and manifold geometry, we show that this failure reflects a "memorization" strategy for generation via data superposition rather than via composition of the true factorized features. We show that when models are forced-through architectural modifications with regularization or curated training data-to render the disentangled latents into the full-dimensional representational (pixel) space, they can be highly data-efficient and effective at composing in OOD regions. These findings underscore that disentangled latents in an abstract representation are insufficient and show that if models can represent disentangled factors directly in the output representational space, it can achieve robust compositional generalization.

Compositional Generalization via Forced Rendering of Disentangled Latents

TL;DR

The paper investigates why disentangled latent representations often fail to support robust compositional generalization, using a controlled 2D Gaussian bump task to reveal layerwise re-entanglement and memorization as key failure modes. By applying kernel and transport analyses, the authors show that factorized latents do not guarantee extrapolation when downstream layers warp the representation and memorize training data. They propose two practical remedies—architectural rendering of disentangled factors into the output space with low-rank embedding regularization, and data curricula that isolate factors as pixel-space building blocks (e.g., stripes)—which yield data-efficient, strong OOD compositional generalization. The findings emphasize that factorization must be preserved in the output representation, guiding design principles for more reliable, compositional neural models applicable to vision and beyond.

Abstract

Composition-the ability to generate myriad variations from finite means-is believed to underlie powerful generalization. However, compositional generalization remains a key challenge for deep learning. A widely held assumption is that learning disentangled (factorized) representations naturally supports this kind of extrapolation. Yet, empirical results are mixed, with many generative models failing to recognize and compose factors to generate out-of-distribution (OOD) samples. In this work, we investigate a controlled 2D Gaussian "bump" generation task with fully disentangled (x,y) inputs, demonstrating that standard generative architectures still fail in OOD regions when training with partial data, by re-entangling latent representations in subsequent layers. By examining the model's learned kernels and manifold geometry, we show that this failure reflects a "memorization" strategy for generation via data superposition rather than via composition of the true factorized features. We show that when models are forced-through architectural modifications with regularization or curated training data-to render the disentangled latents into the full-dimensional representational (pixel) space, they can be highly data-efficient and effective at composing in OOD regions. These findings underscore that disentangled latents in an abstract representation are insufficient and show that if models can represent disentangled factors directly in the output representational space, it can achieve robust compositional generalization.

Paper Structure

This paper contains 63 sections, 44 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: Various disentangled input/latent encodings fail to support compositional generalization. (a) shows the experimental setup for the 2D Gaussian bump toy experiment, where the scalar $(x,y)$ coordinate pair is encoded into disentangled representation (population-based vs. ramp-based encodings), which is then fed to a decoder-only architecture to generate a $N\times N$ grayscale image of a single 2D Gaussian bump centered at that corresponding $(x,y)$ location. The training dataset excludes all images that contain 2D Gaussian bumps centered within the red-shaded OOD region in the center of the image field. (b-c) shows the MSE error contour plots and sample generated ID/OOD images of bump-based and ramp-based encoding of the $x$ and $y$ input, the compositional OOD region is marked with by the red-dashed bounding box and the ground truth bump location is marked by a red cross. For a non-compositional network trained with bump-encoded inputs, (d) demonstrates that it learns to "superpose" seen ID training data when asked to compositionally generalize, and (e) shows the agreement between a theoretical binary factorized kernel with the similarity matrix (computed based on the pixel overlap) between the model's ID and OOD generated samples.
  • Figure 2: Cause of failure of compositional generalization despite disentangled inputs/latents: input memorization by decoder undoes factorization. (a-b) show the linear probe metrics of the learned representation as a function of $x$ and $y$ for bump- and ramp-input-encodings, and (c-d) show the factorization score (Eq. \ref{['eq:factorization']}), and (e-f) show the volume metric $\log_{10}(dv)$, all as a function of layer depth, for models trained with bump-based vs. ramp-based input encodings. The linear probe metric is defined as the $R^2$ scores of fitting two linear classifiers with respect to $x$ and $y$.
  • Figure 3: Inducing compositional generalization through architectural rendering and regularization constraints. (a) Schematic of architecturally forced rendering of the initially disentangled representations (with 1-hot input encodings) into a space matching the output space (disentangled processing). (b) Sampled embedding activations corresponding to $x=14$ and $y=14$ for networks trained without and with regularization, respectively. (c) Generated images when the OOD region is on the top right corner, for the non-regularised (top) and regularised networks (bottom) respectively. (d) Volume metric comparison between networks trained without and with regularization as a function of layer depth, respectively. (e) OOD vs. ID MSE plot for various ablation studies over many runs.
  • Figure 4: Inducing generalizable composition by training single disentangled factors to render (data curriculum). (a) Sample grayscale images of 1D Gaussian "stripes" at $x=14$ and $y=14$. (b) Generated OOD output as a function of number of Gaussian bumps (all outside the upper right OOD region) included in the training dataset. (c) Data/sample efficiency: data scaling of the stripes-only ($\sim N$) vs. bumps-only ($\sim N^3$) datasets to reach $90\%$ accuracy of $x$ and $y$ generation as a function of image size $N$ (accuracy assessed based on the location of the darkest pixel). Here the stripes + bumps dataset consists of $\sim 7N$ bumps + $2N$ stripes; learning breaks the curse of dimensionality due to the ability to compositionally generalize zero-shot. (d) Volume metric as a function of layer depth of the network trained on a dataset of stripes + bumps. (e) 2D neuron activations across different channels in the first layer (layer 0) of a network trained trained on a dataset of stripes + 50 bumps. (f) Neural tuning curves of two sample neurons at layer 0 and layer 3 of the same network as in (e).
  • Figure 5: The action of the neural net can be imagined as wrapping of the (originally flat) native space (left) into the embedding space (right).
  • ...and 10 more figures