Reasoning-Enhanced Object-Centric Learning for Videos
Jian Li, Pu Ren, Yang Liu, Hao Sun
TL;DR
This work tackles object-centric video understanding by integrating a memory-buffered reasoning module, STATM, with slot-based perception. By combining a Memory buffer that stores past slots and a Slot-based Time-Space Transformer that performs cross-temporal and spatial attention, the approach enhances object segmentation, tracking, and prediction, while remaining computationally efficient through a CS fusion architecture. Experiments on MOVi datasets show substantial perceptual gains for SAVi/SAVi++, and on CLEVRER, STATM achieves leading predictive-VQA performance and robust long-horizon predictions, outperforming SlotFormer in key settings. The results underscore a tight perception-prediction loop for intuitive physics in artificial systems and point to future work in joint training and real-world evaluations.
Abstract
Object-centric learning aims to break down complex visual scenes into more manageable object representations, enhancing the understanding and reasoning abilities of machine learning systems toward the physical world. Recently, slot-based video models have demonstrated remarkable proficiency in segmenting and tracking objects, but they overlook the importance of the effective reasoning module. In the real world, reasoning and predictive abilities play a crucial role in human perception and object tracking; in particular, these abilities are closely related to human intuitive physics. Inspired by this, we designed a novel reasoning module called the Slot-based Time-Space Transformer with Memory buffer (STATM) to enhance the model's perception ability in complex scenes. The memory buffer primarily serves as storage for slot information from upstream modules, the Slot-based Time-Space Transformer makes predictions through slot-based spatiotemporal attention computations and fusion. Our experimental results on various datasets indicate that the STATM module can significantly enhance the capabilities of multiple state-of-the-art object-centric learning models for video. Moreover, as a predictive model, the STATM module also performs well in downstream prediction and Visual Question Answering (VQA) tasks. We will release our codes and data at https://github.com/intell-sci-comput/STATM.
