Video Prediction Models as General Visual Encoders
James Maier, Nishanth Mohankumar
TL;DR
The paper addresses how open-source video prediction models can serve as general visual encoders for downstream vision tasks, focusing on instance segmentation on BAIR Robot Pushing. It adopts a 3D VQVAE-based video encoder conditioned on a single input frame and evaluates its latent space with various segmentation heads, comparing against a UNET baseline. Across ablations, the unfrozen 3D-ResNet encoder with a transformer-informed latent space and a convolutional decoder achieves IoU ≈ 0.83, demonstrating competitive performance and suggesting that motion-aware latent representations can support segmentation from single frames. The study also highlights the importance of model choice (VideoGPT vs MAGVIT), the value of including temporal information, and the potential for scaling to larger datasets such as COCO to improve generalization. Overall, it provides evidence that generative pretext learning from video data can yield helpful representations for downstream scene analysis tasks.
Abstract
This study explores the potential of open-source video conditional generation models as encoders for downstream tasks, focusing on instance segmentation using the BAIR Robot Pushing Dataset. The researchers propose using video prediction models as general visual encoders, leveraging their ability to capture critical spatial and temporal information which is essential for tasks such as instance segmentation. Inspired by human vision studies, particularly Gestalts principle of common fate, the approach aims to develop a latent space representative of motion from images to effectively discern foreground from background information. The researchers utilize a 3D Vector-Quantized Variational Autoencoder 3D VQVAE video generative encoder model conditioned on an input frame, coupled with downstream segmentation tasks. Experiments involve adapting pre-trained video generative models, analyzing their latent spaces, and training custom decoders for foreground-background segmentation. The findings demonstrate promising results in leveraging generative pretext learning for downstream tasks, working towards enhanced scene analysis and segmentation in computer vision applications.
