Successes and Limitations of Object-centric Models at Compositional Generalisation
Milton L. Montero, Jeffrey S. Bowers, Gaurav Malhotra
TL;DR
The paper addresses the challenge of compositional generalisation in vision, arguing that disentangled VAEs fall short and object-centric approaches may offer better generalisation. It evaluates Slot Attention on object-level composition tasks and introduces a Pentomino-based dataset together with a simplified FgSeg variant to isolate local-feature effects. Results show that object-centric representations can achieve recombination-to-range generalisation and, with careful data design, extrapolate to unseen shapes; per-object reconstructions and partial ablations illustrate the mechanisms and limits. The study also highlights that, despite improvements, latent representations may not be fully abstract and depend on data diversity, motivating future work to integrate Gestalt-inspired priors and broader model comparisons. Overall, the findings suggest object-centric models, when trained with strategic data curation, can robustly support compositional generalisation without language priors, guiding future research directions.
Abstract
In recent years, it has been shown empirically that standard disentangled latent variable models do not support robust compositional learning in the visual domain. Indeed, in spite of being designed with the goal of factorising datasets into their constituent factors of variations, disentangled models show extremely limited compositional generalisation capabilities. On the other hand, object-centric architectures have shown promising compositional skills, albeit these have 1) not been extensively tested and 2) experiments have been limited to scene composition -- where models must generalise to novel combinations of objects in a visual scene instead of novel combinations of object properties. In this work, we show that these compositional generalisation skills extend to this later setting. Furthermore, we present evidence pointing to the source of these skills and how they can be improved through careful training. Finally, we point to one important limitation that still exists which suggests new directions of research.
