Table of Contents
Fetching ...

Provable Compositional Generalization for Object-Centric Learning

Thaddäus Wiedemer, Jack Brady, Alexander Panfilov, Attila Juhos, Matthias Bethge, Wieland Brendel

TL;DR

This work investigates when compositional generalization is guaranteed for object-centric representations through the lens of identifiability theory and shows that autoencoders that satisfy structural assumptions on the decoder and enforce encoder-decoder consistency will learn object-focused representations that provably generalize compositionally.

Abstract

Learning representations that generalize to novel compositions of known concepts is crucial for bridging the gap between human and machine perception. One prominent effort is learning object-centric representations, which are widely conjectured to enable compositional generalization. Yet, it remains unclear when this conjecture will be true, as a principled theoretical or empirical understanding of compositional generalization is lacking. In this work, we investigate when compositional generalization is guaranteed for object-centric representations through the lens of identifiability theory. We show that autoencoders that satisfy structural assumptions on the decoder and enforce encoder-decoder consistency will learn object-centric representations that provably generalize compositionally. We validate our theoretical result and highlight the practical relevance of our assumptions through experiments on synthetic image data.

Provable Compositional Generalization for Object-Centric Learning

TL;DR

This work investigates when compositional generalization is guaranteed for object-centric representations through the lens of identifiability theory and shows that autoencoders that satisfy structural assumptions on the decoder and enforce encoder-decoder consistency will learn object-focused representations that provably generalize compositionally.

Abstract

Learning representations that generalize to novel compositions of known concepts is crucial for bridging the gap between human and machine perception. One prominent effort is learning object-centric representations, which are widely conjectured to enable compositional generalization. Yet, it remains unclear when this conjecture will be true, as a principled theoretical or empirical understanding of compositional generalization is lacking. In this work, we investigate when compositional generalization is guaranteed for object-centric representations through the lens of identifiability theory. We show that autoencoders that satisfy structural assumptions on the decoder and enforce encoder-decoder consistency will learn object-centric representations that provably generalize compositionally. We validate our theoretical result and highlight the practical relevance of our assumptions through experiments on synthetic image data.
Paper Structure (55 sections, 11 theorems, 53 equations, 9 figures, 2 tables)

This paper contains 55 sections, 11 theorems, 53 equations, 9 figures, 2 tables.

Key Result

Theorem 1

Let ${{\boldsymbol f}}: {\mathcal{Z}} \rightarrow {\mathcal{X}}$ be a compositional and irreducible diffeomorphism . Let ${{\mathcal{Z}}^{{ S}}}$ be a convex, slot-supported subset of ${\mathcal{Z}}$. An autoencoder ${({{\hat{\boldsymbol g}}},{{\hat{\boldsymbol f}}}\space)}$ that minimizes ${\mathca

Figures (9)

  • Figure 1: Compositional generalization in object-centric learning. We assume a latent variable model where objects in an image (here, a triangle and a circle) are described by latent slots. Our notion of compositional generalization requires a model to identify the ground-truth latent slots (slot identifiability, Def. \ref{['def:slot_identifiability']}) on the train distribution and to transfer this identifiability to out-of-distribution (OOD) combinations of slots (Def. \ref{['def:compositional_generalization']}). An autoencoder achieves slot identifiability on the train distribution if its decoder is compositional (Thm. \ref{['theo:slot_identifiability_restricted']}). Further, we prove that decoders that are additive are able to generalize OOD as visualized in (A) via the isolated decoder reconstruction error over a 2D projection of the latent space (see App. \ref{['app:rec_error']}). However, this does not guarantee that the entire model generalizes OOD, as the encoder will generally not invert the decoder on OOD slot combinations, leading to a large overall reconstruction error (B). To address this, we introduce a compositional consistency regularizer (Def. \ref{['def:compositional_consistency']}), which allows the full autoencoder to generalize OOD (C, Thm. \ref{['theo:compositional_generalization']}).
  • Figure 2: Overview of our theoretical contribution. (1) We assume access to data from a training space ${{{\mathcal{X}}^{{ S}}}} \subseteq {\mathcal{X}}$, which is generated from a slot-supported subset${{\mathcal{Z}}^{{ S}}}$ of the latent space ${\mathcal{Z}}$ (Def. \ref{['def:marginal_support']}), via a compositional and irreducible generator ${{\boldsymbol f}}$. (2) We show that an autoencoder with a compositional decoder ${{\hat{\boldsymbol f}}}$ trained via the reconstruction objective ${\mathcal{L}}_\text{ rec}$ on this data will slot-identify ground-truth latents ${{\boldsymbol z}}$ on ${{\mathcal{Z}}^{{ S}}}$ (Thm. \ref{['theo:slot_identifiability_restricted']}). Since the inferred latents ${\hat{\boldsymbol z}}$ slot-identify ${{\boldsymbol z}}$ ID on ${{\mathcal{Z}}^{{ S}}}$, their slot-wise recombinations ${{\mathcal{Z}}{'}}$ slot-identify ${{\boldsymbol z}}$ OOD on ${\mathcal{Z}}$. However, the encoder ${{\hat{\boldsymbol g}}}$ is not guaranteed to infer OOD latents such that ${{\hat{\boldsymbol g}}}({\mathcal{X}}) = {\hat{\mathcal{Z}}} = {{\mathcal{Z}}{'}}$. (3) On the other hand, if the decoder ${{\hat{\boldsymbol f}}}$ is additive, its reconstructions are guaranteed to generalize such that ${{\hat{\boldsymbol f}}}({{\mathcal{Z}}{'}}) = {\mathcal{X}}$ (Thm. \ref{['theo:decoder_generalization']}). (4) Therefore, regularizing the encoder ${{\hat{\boldsymbol g}}}$ to invert ${{\hat{\boldsymbol f}}}$ using our proposed compositional consistency objective ${\mathcal{L}}_\text{cons}$ (Def. \ref{['def:compositional_consistency']}) enforces ${\hat{\mathcal{Z}}} = {{\mathcal{Z}}{'}}$, thus enabling the model to generalize compositionally (Thm. \ref{['theo:compositional_generalization']}).
  • Figure 3: Compositional consistency regularization. In addition to the reconstruction objective, ${\mathcal{L}}_\text{cons}$ is minimized on recombined latents ${{{\boldsymbol z}}{'}}$. Recombining slots of the inferred latents ${\hat{\boldsymbol z}}$ of two ID samples produces a latent ${{{\boldsymbol z}}{'}}$, which can be rendered to an OOD sample ${{\boldsymbol x}{'}}$ due to the decoder ${{\hat{\boldsymbol f}}}$ generalizing OOD. The encoder ${{\hat{\boldsymbol g}}}$ is optimized to re-encode this sample to match ${{{\boldsymbol z}}{'}}$.
  • Figure 4: Experimental validation of Thm. \ref{['theo:compositional_generalization']}. Left: Slot identifiability is measured throughout training as a function of reconstruction loss ($\mathcal{L}_\text{rec}$, Eq. \ref{['eq:lrec']}) and compositional consistency ($\mathcal{L}_\text{cons}$, Def. \ref{['def:compositional_consistency']}). As predicted by Thm. \ref{['theo:compositional_generalization']}, models which minimize $\mathcal{L}_\text{rec}$ and $\mathcal{L}_\text{cons}$ learn representations that are slot identifiable OOD. Right: Compositional contrast (see App. \ref{['app:comp_contrast']}) decreases throughout training, indicating that the decoder is implicitly optimized to be compositional (Def. \ref{['def:compositional']}).
  • Figure 5: Compositional generalization for Slot Attention. Visualizing the decoder reconstruction error over a 2D projection of the latent space (see App. \ref{['app:rec_error']} for details) reveals that the non-additive masked decoder in Slot Attention does not generalize OOD on our dataset (A). Making the decoder additive by replacing softmax mask normalization with slot-wise sigmoid functions makes the decoder additive and enables OOD generalization (B, Thm. \ref{['theo:decoder_generalization']}). The full model does not generalize compositionally, however, since the encoder fails to invert the decoder OOD (C). Regularizing with the compositional consistency loss addresses this, enabling generalization (D, Thm. \ref{['theo:compositional_generalization']}).
  • ...and 4 more figures

Theorems & Definitions (44)

  • Definition 1: Slot-supported subset
  • Definition 2: Slot identifiability
  • Definition 3: Compositional generalization
  • Definition 4: Compositionality
  • Theorem 1: Slot identifiability on slot-supported subset
  • Definition 5: Additive decoder
  • Theorem 2: Decoder generalization
  • Definition 6: Compositional consistency
  • Theorem 3: Compositionally generalizing autoencoder
  • Definition 7: ${C^k}$-Diffeomorphism
  • ...and 34 more