Table of Contents
Fetching ...

Reasoning-Enhanced Object-Centric Learning for Videos

Jian Li, Pu Ren, Yang Liu, Hao Sun

TL;DR

This work tackles object-centric video understanding by integrating a memory-buffered reasoning module, STATM, with slot-based perception. By combining a Memory buffer that stores past slots and a Slot-based Time-Space Transformer that performs cross-temporal and spatial attention, the approach enhances object segmentation, tracking, and prediction, while remaining computationally efficient through a CS fusion architecture. Experiments on MOVi datasets show substantial perceptual gains for SAVi/SAVi++, and on CLEVRER, STATM achieves leading predictive-VQA performance and robust long-horizon predictions, outperforming SlotFormer in key settings. The results underscore a tight perception-prediction loop for intuitive physics in artificial systems and point to future work in joint training and real-world evaluations.

Abstract

Object-centric learning aims to break down complex visual scenes into more manageable object representations, enhancing the understanding and reasoning abilities of machine learning systems toward the physical world. Recently, slot-based video models have demonstrated remarkable proficiency in segmenting and tracking objects, but they overlook the importance of the effective reasoning module. In the real world, reasoning and predictive abilities play a crucial role in human perception and object tracking; in particular, these abilities are closely related to human intuitive physics. Inspired by this, we designed a novel reasoning module called the Slot-based Time-Space Transformer with Memory buffer (STATM) to enhance the model's perception ability in complex scenes. The memory buffer primarily serves as storage for slot information from upstream modules, the Slot-based Time-Space Transformer makes predictions through slot-based spatiotemporal attention computations and fusion. Our experimental results on various datasets indicate that the STATM module can significantly enhance the capabilities of multiple state-of-the-art object-centric learning models for video. Moreover, as a predictive model, the STATM module also performs well in downstream prediction and Visual Question Answering (VQA) tasks. We will release our codes and data at https://github.com/intell-sci-comput/STATM.

Reasoning-Enhanced Object-Centric Learning for Videos

TL;DR

This work tackles object-centric video understanding by integrating a memory-buffered reasoning module, STATM, with slot-based perception. By combining a Memory buffer that stores past slots and a Slot-based Time-Space Transformer that performs cross-temporal and spatial attention, the approach enhances object segmentation, tracking, and prediction, while remaining computationally efficient through a CS fusion architecture. Experiments on MOVi datasets show substantial perceptual gains for SAVi/SAVi++, and on CLEVRER, STATM achieves leading predictive-VQA performance and robust long-horizon predictions, outperforming SlotFormer in key settings. The results underscore a tight perception-prediction loop for intuitive physics in artificial systems and point to future work in joint training and real-world evaluations.

Abstract

Object-centric learning aims to break down complex visual scenes into more manageable object representations, enhancing the understanding and reasoning abilities of machine learning systems toward the physical world. Recently, slot-based video models have demonstrated remarkable proficiency in segmenting and tracking objects, but they overlook the importance of the effective reasoning module. In the real world, reasoning and predictive abilities play a crucial role in human perception and object tracking; in particular, these abilities are closely related to human intuitive physics. Inspired by this, we designed a novel reasoning module called the Slot-based Time-Space Transformer with Memory buffer (STATM) to enhance the model's perception ability in complex scenes. The memory buffer primarily serves as storage for slot information from upstream modules, the Slot-based Time-Space Transformer makes predictions through slot-based spatiotemporal attention computations and fusion. Our experimental results on various datasets indicate that the STATM module can significantly enhance the capabilities of multiple state-of-the-art object-centric learning models for video. Moreover, as a predictive model, the STATM module also performs well in downstream prediction and Visual Question Answering (VQA) tasks. We will release our codes and data at https://github.com/intell-sci-comput/STATM.
Paper Structure (19 sections, 2 equations, 16 figures, 17 tables)

This paper contains 19 sections, 2 equations, 16 figures, 17 tables.

Figures (16)

  • Figure 1: Slot-based Time-Space Transformer with Memory buffer architecture overview. The model employs Slot Attention locatello2020object for perception, which utilizes slot information predicted by STATM predictor from previous timestep and features extracted by encoder to update slot information. For the first frame, the initial slot information is obtained through either Gaussian distribution or hints module. The updated slot information is then stored in a memory buffer for subsequent use by the TATM. TATM performs reasoning by incorporating temporal cross-attention and spatial self-attention. The integration of temporal and spatial attention can be achieved in various ways. STATM supports both single-step predictions and long-sequence rollouts, where single-step prediction results can be used by Slot Attention to update slot information, and long-sequence rollout results can be used to downstream tasks such as VQA. Both perceptual and predicted slot information can be used by the decoder to obtain reconstruction results and segmentation masks. The architecture features perception and prediction modules that mutually enhance each other.
  • Figure 2: Spatiotemporal attention computation architectures. The green slots represent those employed for spatial attention computation, while the orange slots are indicative of those used for temporal attention computation.
  • Figure 3: Qualitative results of our model compared to SAVi and SAVi++ on the MOVi dataset. Compared with SAVi and SAVi++, our model is slightly better than the SAVi/SAVi++ mode on the relatively simple datasets. As the complexity of the datasets increases, the advantage of our model becomes more pronounced.
  • Figure 4: Qualitative results of our model compared to SAVi++. (a) When a new object appears, the SAVi++ cannot recognize it, but our model can correctly identifies it after 1-2 frames. (b) When an object reappears after being obscured, the SAVi++ either assigns it to a different slot (color change) or fails to recognize it. In contrast, our model can correctly identify it.
  • Figure 5: Results of long-sequence prediction on CLEVRER. After surpassing a certain time point, results generated by SlotFormer in prediction clearly begin to deviate from the ground truth, exhibiting artifacts such as blurry (orange boxes), incorrect dynamics (red boxes), and inaccurate colors (green boxes). Meanwhile, our model demonstrates good performance.
  • ...and 11 more figures