Provable Compositional Generalization for Object-Centric Learning

Thaddäus Wiedemer; Jack Brady; Alexander Panfilov; Attila Juhos; Matthias Bethge; Wieland Brendel

Provable Compositional Generalization for Object-Centric Learning

Thaddäus Wiedemer, Jack Brady, Alexander Panfilov, Attila Juhos, Matthias Bethge, Wieland Brendel

TL;DR

This work investigates when compositional generalization is guaranteed for object-centric representations through the lens of identifiability theory and shows that autoencoders that satisfy structural assumptions on the decoder and enforce encoder-decoder consistency will learn object-focused representations that provably generalize compositionally.

Abstract

Learning representations that generalize to novel compositions of known concepts is crucial for bridging the gap between human and machine perception. One prominent effort is learning object-centric representations, which are widely conjectured to enable compositional generalization. Yet, it remains unclear when this conjecture will be true, as a principled theoretical or empirical understanding of compositional generalization is lacking. In this work, we investigate when compositional generalization is guaranteed for object-centric representations through the lens of identifiability theory. We show that autoencoders that satisfy structural assumptions on the decoder and enforce encoder-decoder consistency will learn object-centric representations that provably generalize compositionally. We validate our theoretical result and highlight the practical relevance of our assumptions through experiments on synthetic image data.

Provable Compositional Generalization for Object-Centric Learning

TL;DR

Abstract

Paper Structure (55 sections, 11 theorems, 53 equations, 9 figures, 2 tables)

This paper contains 55 sections, 11 theorems, 53 equations, 9 figures, 2 tables.

Introduction
Notation
Problem Setup
Challenge 1: Identifiability
Challenge 2: Generalization
Compositional Generalization in Theory
Decoder Generalization via Additivity
Encoder Generalization via Compositional Consistency
Putting it All Together
Compositional Generalization in Practice
Compositionality
Additivity
Compositional Consistency
Related work
Theoretical Analyses of Compositional Generalization
...and 40 more sections

Key Result

Theorem 1

Let ${{\boldsymbol f}}: {\mathcal{Z}} \rightarrow {\mathcal{X}}$ be a compositional and irreducible diffeomorphism . Let ${{\mathcal{Z}}^{{ S}}}$ be a convex, slot-supported subset of ${\mathcal{Z}}$. An autoencoder ${({{\hat{\boldsymbol g}}},{{\hat{\boldsymbol f}}}\space)}$ that minimizes ${\mathca

Figures (9)

Figure 1: Compositional generalization in object-centric learning. We assume a latent variable model where objects in an image (here, a triangle and a circle) are described by latent slots. Our notion of compositional generalization requires a model to identify the ground-truth latent slots (slot identifiability, Def. \ref{['def:slot_identifiability']}) on the train distribution and to transfer this identifiability to out-of-distribution (OOD) combinations of slots (Def. \ref{['def:compositional_generalization']}). An autoencoder achieves slot identifiability on the train distribution if its decoder is compositional (Thm. \ref{['theo:slot_identifiability_restricted']}). Further, we prove that decoders that are additive are able to generalize OOD as visualized in (A) via the isolated decoder reconstruction error over a 2D projection of the latent space (see App. \ref{['app:rec_error']}). However, this does not guarantee that the entire model generalizes OOD, as the encoder will generally not invert the decoder on OOD slot combinations, leading to a large overall reconstruction error (B). To address this, we introduce a compositional consistency regularizer (Def. \ref{['def:compositional_consistency']}), which allows the full autoencoder to generalize OOD (C, Thm. \ref{['theo:compositional_generalization']}).
Figure 2: Overview of our theoretical contribution. (1) We assume access to data from a training space ${{{\mathcal{X}}^{{ S}}}} \subseteq {\mathcal{X}}$, which is generated from a slot-supported subset${{\mathcal{Z}}^{{ S}}}$ of the latent space ${\mathcal{Z}}$ (Def. \ref{['def:marginal_support']}), via a compositional and irreducible generator ${{\boldsymbol f}}$. (2) We show that an autoencoder with a compositional decoder ${{\hat{\boldsymbol f}}}$ trained via the reconstruction objective ${\mathcal{L}}_\text{ rec}$ on this data will slot-identify ground-truth latents ${{\boldsymbol z}}$ on ${{\mathcal{Z}}^{{ S}}}$ (Thm. \ref{['theo:slot_identifiability_restricted']}). Since the inferred latents ${\hat{\boldsymbol z}}$ slot-identify ${{\boldsymbol z}}$ ID on ${{\mathcal{Z}}^{{ S}}}$, their slot-wise recombinations ${{\mathcal{Z}}{'}}$ slot-identify ${{\boldsymbol z}}$ OOD on ${\mathcal{Z}}$. However, the encoder ${{\hat{\boldsymbol g}}}$ is not guaranteed to infer OOD latents such that ${{\hat{\boldsymbol g}}}({\mathcal{X}}) = {\hat{\mathcal{Z}}} = {{\mathcal{Z}}{'}}$. (3) On the other hand, if the decoder ${{\hat{\boldsymbol f}}}$ is additive, its reconstructions are guaranteed to generalize such that ${{\hat{\boldsymbol f}}}({{\mathcal{Z}}{'}}) = {\mathcal{X}}$ (Thm. \ref{['theo:decoder_generalization']}). (4) Therefore, regularizing the encoder ${{\hat{\boldsymbol g}}}$ to invert ${{\hat{\boldsymbol f}}}$ using our proposed compositional consistency objective ${\mathcal{L}}_\text{cons}$ (Def. \ref{['def:compositional_consistency']}) enforces ${\hat{\mathcal{Z}}} = {{\mathcal{Z}}{'}}$, thus enabling the model to generalize compositionally (Thm. \ref{['theo:compositional_generalization']}).
Figure 3: Compositional consistency regularization. In addition to the reconstruction objective, ${\mathcal{L}}_\text{cons}$ is minimized on recombined latents ${{{\boldsymbol z}}{'}}$. Recombining slots of the inferred latents ${\hat{\boldsymbol z}}$ of two ID samples produces a latent ${{{\boldsymbol z}}{'}}$, which can be rendered to an OOD sample ${{\boldsymbol x}{'}}$ due to the decoder ${{\hat{\boldsymbol f}}}$ generalizing OOD. The encoder ${{\hat{\boldsymbol g}}}$ is optimized to re-encode this sample to match ${{{\boldsymbol z}}{'}}$.
Figure 4: Experimental validation of Thm. \ref{['theo:compositional_generalization']}. Left: Slot identifiability is measured throughout training as a function of reconstruction loss ($\mathcal{L}_\text{rec}$, Eq. \ref{['eq:lrec']}) and compositional consistency ($\mathcal{L}_\text{cons}$, Def. \ref{['def:compositional_consistency']}). As predicted by Thm. \ref{['theo:compositional_generalization']}, models which minimize $\mathcal{L}_\text{rec}$ and $\mathcal{L}_\text{cons}$ learn representations that are slot identifiable OOD. Right: Compositional contrast (see App. \ref{['app:comp_contrast']}) decreases throughout training, indicating that the decoder is implicitly optimized to be compositional (Def. \ref{['def:compositional']}).
Figure 5: Compositional generalization for Slot Attention. Visualizing the decoder reconstruction error over a 2D projection of the latent space (see App. \ref{['app:rec_error']} for details) reveals that the non-additive masked decoder in Slot Attention does not generalize OOD on our dataset (A). Making the decoder additive by replacing softmax mask normalization with slot-wise sigmoid functions makes the decoder additive and enables OOD generalization (B, Thm. \ref{['theo:decoder_generalization']}). The full model does not generalize compositionally, however, since the encoder fails to invert the decoder OOD (C). Regularizing with the compositional consistency loss addresses this, enabling generalization (D, Thm. \ref{['theo:compositional_generalization']}).
...and 4 more figures

Theorems & Definitions (44)

Definition 1: Slot-supported subset
Definition 2: Slot identifiability
Definition 3: Compositional generalization
Definition 4: Compositionality
Theorem 1: Slot identifiability on slot-supported subset
Definition 5: Additive decoder
Theorem 2: Decoder generalization
Definition 6: Compositional consistency
Theorem 3: Compositionally generalizing autoencoder
Definition 7: ${C^k}$-Diffeomorphism
...and 34 more

Provable Compositional Generalization for Object-Centric Learning

TL;DR

Abstract

Provable Compositional Generalization for Object-Centric Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (44)