Table of Contents
Fetching ...

Successes and Limitations of Object-centric Models at Compositional Generalisation

Milton L. Montero, Jeffrey S. Bowers, Gaurav Malhotra

TL;DR

The paper addresses the challenge of compositional generalisation in vision, arguing that disentangled VAEs fall short and object-centric approaches may offer better generalisation. It evaluates Slot Attention on object-level composition tasks and introduces a Pentomino-based dataset together with a simplified FgSeg variant to isolate local-feature effects. Results show that object-centric representations can achieve recombination-to-range generalisation and, with careful data design, extrapolate to unseen shapes; per-object reconstructions and partial ablations illustrate the mechanisms and limits. The study also highlights that, despite improvements, latent representations may not be fully abstract and depend on data diversity, motivating future work to integrate Gestalt-inspired priors and broader model comparisons. Overall, the findings suggest object-centric models, when trained with strategic data curation, can robustly support compositional generalisation without language priors, guiding future research directions.

Abstract

In recent years, it has been shown empirically that standard disentangled latent variable models do not support robust compositional learning in the visual domain. Indeed, in spite of being designed with the goal of factorising datasets into their constituent factors of variations, disentangled models show extremely limited compositional generalisation capabilities. On the other hand, object-centric architectures have shown promising compositional skills, albeit these have 1) not been extensively tested and 2) experiments have been limited to scene composition -- where models must generalise to novel combinations of objects in a visual scene instead of novel combinations of object properties. In this work, we show that these compositional generalisation skills extend to this later setting. Furthermore, we present evidence pointing to the source of these skills and how they can be improved through careful training. Finally, we point to one important limitation that still exists which suggests new directions of research.

Successes and Limitations of Object-centric Models at Compositional Generalisation

TL;DR

The paper addresses the challenge of compositional generalisation in vision, arguing that disentangled VAEs fall short and object-centric approaches may offer better generalisation. It evaluates Slot Attention on object-level composition tasks and introduces a Pentomino-based dataset together with a simplified FgSeg variant to isolate local-feature effects. Results show that object-centric representations can achieve recombination-to-range generalisation and, with careful data design, extrapolate to unseen shapes; per-object reconstructions and partial ablations illustrate the mechanisms and limits. The study also highlights that, despite improvements, latent representations may not be fully abstract and depend on data diversity, motivating future work to integrate Gestalt-inspired priors and broader model comparisons. Overall, the findings suggest object-centric models, when trained with strategic data curation, can robustly support compositional generalisation without language priors, guiding future research directions.

Abstract

In recent years, it has been shown empirically that standard disentangled latent variable models do not support robust compositional learning in the visual domain. Indeed, in spite of being designed with the goal of factorising datasets into their constituent factors of variations, disentangled models show extremely limited compositional generalisation capabilities. On the other hand, object-centric architectures have shown promising compositional skills, albeit these have 1) not been extensively tested and 2) experiments have been limited to scene composition -- where models must generalise to novel combinations of objects in a visual scene instead of novel combinations of object properties. In this work, we show that these compositional generalisation skills extend to this later setting. Furthermore, we present evidence pointing to the source of these skills and how they can be improved through careful training. Finally, we point to one important limitation that still exists which suggests new directions of research.

Paper Structure

This paper contains 13 sections, 1 equation, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Slot Attention generalisation results: Reconstructions for a model when trained on all but some combinations of generative factors. The model is tested on said excluded combinations. Left) Generalisation results when excluding half of the combinations of the colors with the pill shape in 3DShapes. Right) Analogous test on dSprites where we exclude half of the rotations of the heart.
  • Figure 2: Pentomino shapes a) The twelve Pentomino shapes and their names. We construct the dataset by performing affine transformations of these shapes: 5 values of scale, 40 values of rotation and 20 values of translation along each of the X and Y axis. b) The low-level features that comprise the different shapes. From top to bottom: straight lines, convex right angles and concave right angles.
  • Figure 3: Generalization to novel shape and rotation combinations in the Pentomino dataset. Generalization reconstructions for both FgSeg and a WAE control model. The models where trained on 11 of the 12 Pentomino shapes and tested at reconstructing a held out one in different configurations of position, rotation and scale.
  • Figure 4: New shape extrapolation On the left, Slot Attention reconstructions of a novel shape, in this case the W. Left to right, different values of rotation sampled uniformly over the whole range of values $[0, 360)$ can be seen. On the right, the same results for WAE. It is clear that SA succeeds where WAE does not.
  • Figure 5: Pentomino shapes a) The twelve pentomino shapes and their names. We construct the dataset by performing affine transformations of these shapes: 5 values of scale, 40 values of rotation and 20 values of translation along each of the X and Y axis. b) The low-level features that comprise the different shapes. From top to bottom: straight lines, convex right angles and concave right angles. c) Example stimuli containing different configurations of the different factors, with each shape represented once.
  • ...and 4 more figures