On the Benefits of Instance Decomposition in Video Prediction Models
Eliyas Suleyman, Paul Henderson, Nicolas Pugeault
TL;DR
This work tackles the challenge of predicting future video frames by explicitly decomposing a dynamic scene into object-centric latent slots within a latent-transformer framework. It introduces an object-aware autoencoder (OAAE) with class-specific encoders and codebooks and couples it to a multi-object transformer that uses instance-level self- and cross-attention to model both individual dynamics and inter-object interactions. Controlled experiments across synthetic and real datasets show that object decomposition, coupled with cross-attention, yields higher-quality predictions at equal capacity, especially in scenes with strong interactions. The approach demonstrates the practical benefits of object-centric modeling for scalable video prediction and provides design guidance for future latent-transformer architectures.
Abstract
Video prediction is a crucial task for intelligent agents such as robots and autonomous vehicles, since it enables them to anticipate and act early on time-critical incidents. State-of-the-art video prediction methods typically model the dynamics of a scene jointly and implicitly, without any explicit decomposition into separate objects. This is challenging and potentially sub-optimal, as every object in a dynamic scene has their own pattern of movement, typically somewhat independent of others. In this paper, we investigate the benefit of explicitly modeling the objects in a dynamic scene separately within the context of latent-transformer video prediction models. We conduct detailed and carefully-controlled experiments on both synthetic and real-world datasets; our results show that decomposing a dynamic scene leads to higher quality predictions compared with models of a similar capacity that lack such decomposition.
