Learning Object Permanence from Videos via Latent Imaginations

Manuel Traub; Frederic Becker; Sebastian Otte; Martin V. Butz

Learning Object Permanence from Videos via Latent Imaginations

Manuel Traub, Frederic Becker, Sebastian Otte, Martin V. Butz

TL;DR

The paper tackles the lack of learned object permanence in deep models by introducing Loci-Looped, a slot-based autoregressive model with an inner latent imagination loop and a percept gate that adaptively fuses internal predictions with visual observations. Trained end-to-end without supervision, it learns to track objects through occlusion, anticipate reappearance, handle sensory interruptions, and imagine long sequences, outperforming strong baselines on occlusion tracking, VoE-like tests, and robustness to missing data. The main contributions are an interpretable, self-supervised object-centric world model with latent imaginations that yield emergent object permanence, directional inertia, and solidity, plus a comprehensive experimental validation across multiple paradigms. This advances practical, human-like scene understanding by enabling robust reasoning about hidden objects directly from video data.

Abstract

While human infants exhibit knowledge about object permanence from two months of age onwards, deep-learning approaches still largely fail to recognize objects' continued existence. We introduce a slot-based autoregressive deep learning system, the looped location and identity tracking model Loci-Looped, which learns to adaptively fuse latent imaginations with pixel-space observations into consistent latent object-specific what and where encodings over time. The novel loop empowers Loci-Looped to learn the physical concepts of object permanence, directional inertia, and object solidity through observation alone. As a result, Loci-Looped tracks objects through occlusions, anticipates their reappearance, and shows signs of surprise and internal revisions when observing implausible object behavior. Notably, Loci-Looped outperforms state-of-the-art baseline models in handling object occlusions and temporary sensory interruptions while exhibiting more compositional, interpretable internal activity patterns. Our work thus introduces the first self-supervised interpretable learning model that learns about object permanence directly from video data without supervision.

Learning Object Permanence from Videos via Latent Imaginations

TL;DR

Abstract

Paper Structure (52 sections, 13 equations, 17 figures, 6 tables)

This paper contains 52 sections, 13 equations, 17 figures, 6 tables.

Introduction
Related Work
Method
Loci-v1
Loci-Looped
Object Mask
Occlusion State
Percept Gate
Training
Loss functions
Experiments and Results
Baselines
Tracking Objects through Occlusion
Training set
Test set
...and 37 more sections

Figures (17)

Figure 1: The object and visibility mask enable an interpretable holistic scene understanding in Loci-Looped. From left to right: Current video frame, reconstructed RGB object, object mask and visibility mask of slot $k$ depicting the blue object.
Figure 2: The slot-wise processing architecture of Loci-Looped. Predictions are made available on two routes. First, through an outer loop in pixel-space, which enables continuous visual object tracking over time; second, through an inner loop, which enables the generation of latent temporal imaginations.
Figure 3: Loci-Looped maintains stable object precepts of the occluded objects. Control Condition: Two objects traverse the scene and both objects reappear. Surprise Condition: Two objects traverse the scene, the blue object reappears while the green object vanishes. Next-frame Imagination: The model's imagination on how the scene unfolds behind the occluder, generated by applying layer summation without the occluder slot. The colored dots show the GT positions of the objects.
Figure 4: Results on the VoE experiment. Surprise is quantified as the maximum slot error in the corresponding frame interval.
Figure 5: Left: Sensory interruptions exp. (CLEVRER dataset) Right: Imagination exp. (bouncing balls dataset). The trajectory of the first 20 predictions is shown.
...and 12 more figures

Learning Object Permanence from Videos via Latent Imaginations

TL;DR

Abstract

Learning Object Permanence from Videos via Latent Imaginations

Authors

TL;DR

Abstract

Table of Contents

Figures (17)