Table of Contents
Fetching ...

VONet: Unsupervised Video Object Learning With Parallel U-Net Attention and Object-wise Sequential VAE

Haonan Yu, Wei Xu

TL;DR

VONet tackles unsupervised video object learning by deriving consistent object-centric representations without supervision. It introduces parallel U-Net-based attention and an object-wise sequential VAE with a transformer decoder to handle complex scenes in video. The method yields state-of-the-art FG-ARI and mIoU across five MOVI datasets, demonstrating robustness to temporal dynamics and scene complexity. The approach offers a scalable and efficient alternative to recurrent slot generation and provides insights into temporal priors via KL balancing and replay-based training.

Abstract

Unsupervised video object learning seeks to decompose video scenes into structural object representations without any supervision from depth, optical flow, or segmentation. We present VONet, an innovative approach that is inspired by MONet. While utilizing a U-Net architecture, VONet employs an efficient and effective parallel attention inference process, generating attention masks for all slots simultaneously. Additionally, to enhance the temporal consistency of each mask across consecutive video frames, VONet develops an object-wise sequential VAE framework. The integration of these innovative encoder-side techniques, in conjunction with an expressive transformer-based decoder, establishes VONet as the leading unsupervised method for object learning across five MOVI datasets, encompassing videos of diverse complexities. Code is available at https://github.com/hnyu/vonet.

VONet: Unsupervised Video Object Learning With Parallel U-Net Attention and Object-wise Sequential VAE

TL;DR

VONet tackles unsupervised video object learning by deriving consistent object-centric representations without supervision. It introduces parallel U-Net-based attention and an object-wise sequential VAE with a transformer decoder to handle complex scenes in video. The method yields state-of-the-art FG-ARI and mIoU across five MOVI datasets, demonstrating robustness to temporal dynamics and scene complexity. The approach offers a scalable and efficient alternative to recurrent slot generation and provides insights into temporal priors via KL balancing and replay-based training.

Abstract

Unsupervised video object learning seeks to decompose video scenes into structural object representations without any supervision from depth, optical flow, or segmentation. We present VONet, an innovative approach that is inspired by MONet. While utilizing a U-Net architecture, VONet employs an efficient and effective parallel attention inference process, generating attention masks for all slots simultaneously. Additionally, to enhance the temporal consistency of each mask across consecutive video frames, VONet develops an object-wise sequential VAE framework. The integration of these innovative encoder-side techniques, in conjunction with an expressive transformer-based decoder, establishes VONet as the leading unsupervised method for object learning across five MOVI datasets, encompassing videos of diverse complexities. Code is available at https://github.com/hnyu/vonet.
Paper Structure (18 sections, 13 equations, 12 figures, 2 tables)

This paper contains 18 sections, 13 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Attention processes of MONet (a) and VONet (b) on a single image. Red arrows represent sequential operations ("scp" stands for the MONet scope), while blue arrows at the same horizontal level represent parallel operations. The dependency on the input image has been omitted for clarity.
  • Figure 2: Diagram of the parallel attention network. Except for the transformer and softmax operator, it is possible to parallelize the executions related to the U-Net components. The skip connections between the U-Net downsampling and upsampling layers have been omitted for clarity.
  • Figure 3: VONet's architecture. The dependency on the input image ${\mathbf{x}}_t$ has been omitted for clarity. * in the subscripts represents the collection of $K$ ($K=2$ here) slots in parallel.
  • Figure 4: Example video frames of the MOVI datasets. A,B,C contain up to 10 objects while D,E contain up to 23 objects in each video.
  • Figure 5: Comparison of the attention inference efficiencies.
  • ...and 7 more figures