Table of Contents
Fetching ...

Parallelized Spatiotemporal Binding

Gautam Singh, Yue Wang, Jiawei Yang, Boris Ivanovic, Sungjin Ahn, Marco Pavone, Tong Che

TL;DR

Parallelizable Spatiotemporal Binder or PSB is introduced, the first temporally-parallelizable slot learning architecture for sequential inputs, which demonstrates stable training on longer sequences, achieves parallelization that results in a 60% increase in training speed, and yields performance that is on par with or better on unsupervised 2D and 3D object-centric scene decomposition and understanding.

Abstract

While modern best practices advocate for scalable architectures that support long-range interactions, object-centric models are yet to fully embrace these architectures. In particular, existing object-centric models for handling sequential inputs, due to their reliance on RNN-based implementation, show poor stability and capacity and are slow to train on long sequences. We introduce Parallelizable Spatiotemporal Binder or PSB, the first temporally-parallelizable slot learning architecture for sequential inputs. Unlike conventional RNN-based approaches, PSB produces object-centric representations, known as slots, for all time-steps in parallel. This is achieved by refining the initial slots across all time-steps through a fixed number of layers equipped with causal attention. By capitalizing on the parallelism induced by our architecture, the proposed model exhibits a significant boost in efficiency. In experiments, we test PSB extensively as an encoder within an auto-encoding framework paired with a wide variety of decoder options. Compared to the state-of-the-art, our architecture demonstrates stable training on longer sequences, achieves parallelization that results in a 60% increase in training speed, and yields performance that is on par with or better on unsupervised 2D and 3D object-centric scene decomposition and understanding.

Parallelized Spatiotemporal Binding

TL;DR

Parallelizable Spatiotemporal Binder or PSB is introduced, the first temporally-parallelizable slot learning architecture for sequential inputs, which demonstrates stable training on longer sequences, achieves parallelization that results in a 60% increase in training speed, and yields performance that is on par with or better on unsupervised 2D and 3D object-centric scene decomposition and understanding.

Abstract

While modern best practices advocate for scalable architectures that support long-range interactions, object-centric models are yet to fully embrace these architectures. In particular, existing object-centric models for handling sequential inputs, due to their reliance on RNN-based implementation, show poor stability and capacity and are slow to train on long sequences. We introduce Parallelizable Spatiotemporal Binder or PSB, the first temporally-parallelizable slot learning architecture for sequential inputs. Unlike conventional RNN-based approaches, PSB produces object-centric representations, known as slots, for all time-steps in parallel. This is achieved by refining the initial slots across all time-steps through a fixed number of layers equipped with causal attention. By capitalizing on the parallelism induced by our architecture, the proposed model exhibits a significant boost in efficiency. In experiments, we test PSB extensively as an encoder within an auto-encoding framework paired with a wide variety of decoder options. Compared to the state-of-the-art, our architecture demonstrates stable training on longer sequences, achieves parallelization that results in a 60% increase in training speed, and yields performance that is on par with or better on unsupervised 2D and 3D object-centric scene decomposition and understanding.
Paper Structure (50 sections, 12 equations, 16 figures, 10 tables)

This paper contains 50 sections, 12 equations, 16 figures, 10 tables.

Figures (16)

  • Figure 1: Conventional Spatiotemporal Binding versus Ours.Left: Conventional object-centric encoders summarize sequential sensory inputs into slots via recurrence, analogous to RNNs. Right: On the other hand, our proposed object-centric encoder achieves this without recurrence, allowing it to be parallelized over the sequence length, similarly to transformers.
  • Figure 2: Unsupervised Object-Centric Learning on MOVi-A and MOVi-B using Spatial Broadcast Decoder. We compare our proposed encoder with the recurrence-based baseline encoder SAVi savi. Top-Left: Video-level FG-ARI score $(\uparrow)$. Top-Right: Reconstruction PSNR $(\uparrow)$. Bottom: Slot linear probing performances $(\uparrow)$. Reported are the $R^2$ score for continuous-valued object factors (position and color) and classification accuracy for categorical object factors (shape, size, and material). We observe that our encoder surpasses the recurrent baseline SAVi in terms of FG-ARI and PSNR, and does markedly better in linear-probing performance for complex factors such as the object shape.
  • Figure 3: Computational Drawbacks of RNN-based Object-Centric Learning. We compare our proposed encoder with the recurrent baseline SAVi. Top: We show validation loss curves (mean and standard deviation computed over 5 seeds) for training runs on MOVi-A and MOVi-B. $T_\text{train}$ denotes the length of each training episode. We note that as we increase the episode length from 6 to 12, SAVi becomes highly unstable while our model continues to train smoothly. Bottom: We report the time taken (in seconds) to perform one training step plotted as a function of the episode length. We observe a speed-up of about 1.6$\times$ over SAVi.
  • Figure 4: Object-Centric Learning on MOVi-A using Our Proposed Encoder. We visualize a given video and its reconstruction and decomposition into objects using the proposed model. We note that object identity is consistently maintained over time as evidenced by the segment colors across frames.
  • Figure 5: Unsupervised Video Segmentation on MOVi-C, D and E using Autoregressive Image-Transformer Decoder. We compare ours with STEVE which is based on the recurrent encoder of savi. Left: Video-level FG-ARI score $(\uparrow)$. Right: Reconstruction PSNR $(\uparrow)$.
  • ...and 11 more figures