Table of Contents
Fetching ...

Next state prediction gives rise to entangled, yet compositional representations of objects

Tankred Saanum, Luca M. Schulze Buschoff, Peter Dayan, Eric Schulz

TL;DR

This paper examines whether distributed models can develop linearly separable representations of objects through unsupervised training on videos of object interactions and finds that, surprisingly, models with distributed representations often match or outperform models with object slots in downstream prediction tasks.

Abstract

Compositional representations are thought to enable humans to generalize across combinatorially vast state spaces. Models with learnable object slots, which encode information about objects in separate latent codes, have shown promise for this type of generalization but rely on strong architectural priors. Models with distributed representations, on the other hand, use overlapping, potentially entangled neural codes, and their ability to support compositional generalization remains underexplored. In this paper we examine whether distributed models can develop linearly separable representations of objects, like slotted models, through unsupervised training on videos of object interactions. We show that, surprisingly, models with distributed representations often match or outperform models with object slots in downstream prediction tasks. Furthermore, we find that linearly separable object representations can emerge without object-centric priors, with auxiliary objectives like next-state prediction playing a key role. Finally, we observe that distributed models' object representations are never fully disentangled, even if they are linearly separable: Multiple objects can be encoded through partially overlapping neural populations while still being highly separable with a linear classifier. We hypothesize that maintaining partially shared codes enables distributed models to better compress object dynamics, potentially enhancing generalization.

Next state prediction gives rise to entangled, yet compositional representations of objects

TL;DR

This paper examines whether distributed models can develop linearly separable representations of objects through unsupervised training on videos of object interactions and finds that, surprisingly, models with distributed representations often match or outperform models with object slots in downstream prediction tasks.

Abstract

Compositional representations are thought to enable humans to generalize across combinatorially vast state spaces. Models with learnable object slots, which encode information about objects in separate latent codes, have shown promise for this type of generalization but rely on strong architectural priors. Models with distributed representations, on the other hand, use overlapping, potentially entangled neural codes, and their ability to support compositional generalization remains underexplored. In this paper we examine whether distributed models can develop linearly separable representations of objects, like slotted models, through unsupervised training on videos of object interactions. We show that, surprisingly, models with distributed representations often match or outperform models with object slots in downstream prediction tasks. Furthermore, we find that linearly separable object representations can emerge without object-centric priors, with auxiliary objectives like next-state prediction playing a key role. Finally, we observe that distributed models' object representations are never fully disentangled, even if they are linearly separable: Multiple objects can be encoded through partially overlapping neural populations while still being highly separable with a linear classifier. We hypothesize that maintaining partially shared codes enables distributed models to better compress object dynamics, potentially enhancing generalization.
Paper Structure (21 sections, 5 equations, 9 figures, 4 tables)

This paper contains 21 sections, 5 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Overview of the decoding analysis and datasets. A: We propose a simple test for assessing compositional object representations. After unsupervised pre-training on object videos, we evaluate the linear separability of models' latent object representations. This is done by training a linear classifier on the absolute differences of two successive encoded frames where only one object changes. B: We evaluate the models on five datasets of dynamically interacting objects, ranging from simple depictions of blocks and sprites to realistic simulations of 3D objects.
  • Figure 2: Prediction accuracies for slotted and non-slotted contrastive dynamics models. In all five datasets we see that the CWM is not only competitive, but sometimes outperforms the CSWM when it comes to predicting object dynamics. Scores are averaged over five seeds, with error bars depicting the standard error of the mean.
  • Figure 3: Object decoding accuracy as a function of training set size, for contrastive models. CWM representations of objects become more linearly separable with dataset size, despite no architectural components that encourage the formation of object-centric representations. However, contrastive learning without next step prediction (CRL) does not give rise to object-centric representations, suggesting an important role for information provided by dynamic data. Scores are averaged over five seeds (three seeds in the MOVi domains), with error bars depicting standard error of the mean.
  • Figure 4: Object decoding accuracy as a function of training set size, for auto-encoding models. The dynamic training scheme yields a monotonic increase in object separability with training set size in four out of five datasets. Scores are averaged over five seeds (three seeds in MOVi domains), with error bars depicting standard error of the mean.
  • Figure 5: Reconstructions and LPIPS similarity for different models on the MOVi (simple) and MOVi-A datasets. Auto-encoding models without object slots approach or match the reconstruction ability of Slot Attention on novel object configurations in the MOVi domain.
  • ...and 4 more figures