Table of Contents
Fetching ...

Learning Object Permanence from Videos via Latent Imaginations

Manuel Traub, Frederic Becker, Sebastian Otte, Martin V. Butz

TL;DR

The paper tackles the lack of learned object permanence in deep models by introducing Loci-Looped, a slot-based autoregressive model with an inner latent imagination loop and a percept gate that adaptively fuses internal predictions with visual observations. Trained end-to-end without supervision, it learns to track objects through occlusion, anticipate reappearance, handle sensory interruptions, and imagine long sequences, outperforming strong baselines on occlusion tracking, VoE-like tests, and robustness to missing data. The main contributions are an interpretable, self-supervised object-centric world model with latent imaginations that yield emergent object permanence, directional inertia, and solidity, plus a comprehensive experimental validation across multiple paradigms. This advances practical, human-like scene understanding by enabling robust reasoning about hidden objects directly from video data.

Abstract

While human infants exhibit knowledge about object permanence from two months of age onwards, deep-learning approaches still largely fail to recognize objects' continued existence. We introduce a slot-based autoregressive deep learning system, the looped location and identity tracking model Loci-Looped, which learns to adaptively fuse latent imaginations with pixel-space observations into consistent latent object-specific what and where encodings over time. The novel loop empowers Loci-Looped to learn the physical concepts of object permanence, directional inertia, and object solidity through observation alone. As a result, Loci-Looped tracks objects through occlusions, anticipates their reappearance, and shows signs of surprise and internal revisions when observing implausible object behavior. Notably, Loci-Looped outperforms state-of-the-art baseline models in handling object occlusions and temporary sensory interruptions while exhibiting more compositional, interpretable internal activity patterns. Our work thus introduces the first self-supervised interpretable learning model that learns about object permanence directly from video data without supervision.

Learning Object Permanence from Videos via Latent Imaginations

TL;DR

The paper tackles the lack of learned object permanence in deep models by introducing Loci-Looped, a slot-based autoregressive model with an inner latent imagination loop and a percept gate that adaptively fuses internal predictions with visual observations. Trained end-to-end without supervision, it learns to track objects through occlusion, anticipate reappearance, handle sensory interruptions, and imagine long sequences, outperforming strong baselines on occlusion tracking, VoE-like tests, and robustness to missing data. The main contributions are an interpretable, self-supervised object-centric world model with latent imaginations that yield emergent object permanence, directional inertia, and solidity, plus a comprehensive experimental validation across multiple paradigms. This advances practical, human-like scene understanding by enabling robust reasoning about hidden objects directly from video data.

Abstract

While human infants exhibit knowledge about object permanence from two months of age onwards, deep-learning approaches still largely fail to recognize objects' continued existence. We introduce a slot-based autoregressive deep learning system, the looped location and identity tracking model Loci-Looped, which learns to adaptively fuse latent imaginations with pixel-space observations into consistent latent object-specific what and where encodings over time. The novel loop empowers Loci-Looped to learn the physical concepts of object permanence, directional inertia, and object solidity through observation alone. As a result, Loci-Looped tracks objects through occlusions, anticipates their reappearance, and shows signs of surprise and internal revisions when observing implausible object behavior. Notably, Loci-Looped outperforms state-of-the-art baseline models in handling object occlusions and temporary sensory interruptions while exhibiting more compositional, interpretable internal activity patterns. Our work thus introduces the first self-supervised interpretable learning model that learns about object permanence directly from video data without supervision.
Paper Structure (52 sections, 13 equations, 17 figures, 6 tables)

This paper contains 52 sections, 13 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: The object and visibility mask enable an interpretable holistic scene understanding in Loci-Looped. From left to right: Current video frame, reconstructed RGB object, object mask and visibility mask of slot $k$ depicting the blue object.
  • Figure 2: The slot-wise processing architecture of Loci-Looped. Predictions are made available on two routes. First, through an outer loop in pixel-space, which enables continuous visual object tracking over time; second, through an inner loop, which enables the generation of latent temporal imaginations.
  • Figure 3: Loci-Looped maintains stable object precepts of the occluded objects. Control Condition: Two objects traverse the scene and both objects reappear. Surprise Condition: Two objects traverse the scene, the blue object reappears while the green object vanishes. Next-frame Imagination: The model's imagination on how the scene unfolds behind the occluder, generated by applying layer summation without the occluder slot. The colored dots show the GT positions of the objects.
  • Figure 4: Results on the VoE experiment. Surprise is quantified as the maximum slot error in the corresponding frame interval.
  • Figure 5: Left: Sensory interruptions exp. (CLEVRER dataset) Right: Imagination exp. (bouncing balls dataset). The trajectory of the first 20 predictions is shown.
  • ...and 12 more figures