Table of Contents
Fetching ...

Moving Off-the-Grid: Scene-Grounded Video Representations

Sjoerd van Steenkiste, Daniel Zoran, Yi Yang, Yulia Rubanova, Rishabh Kabra, Carl Doersch, Dilara Gokay, Joseph Heyward, Etienne Pot, Klaus Greff, Drew A. Hudson, Thomas Albert Keck, Joao Carreira, Alexey Dosovitskiy, Mehdi S. M. Sajjadi, Thomas Kipf

TL;DR

This work presents Moving Off-the-Grid (MooG), a self-supervised video representation model that offers an alternative approach, allowing tokens to move"off-the-grid" to better enable them to represent scene elements consistently, even as they move across the image plane through time.

Abstract

Current vision models typically maintain a fixed correspondence between their representation structure and image space. Each layer comprises a set of tokens arranged "on-the-grid," which biases patches or tokens to encode information at a specific spatio(-temporal) location. In this work we present Moving Off-the-Grid (MooG), a self-supervised video representation model that offers an alternative approach, allowing tokens to move "off-the-grid" to better enable them to represent scene elements consistently, even as they move across the image plane through time. By using a combination of cross-attention and positional embeddings we disentangle the representation structure and image structure. We find that a simple self-supervised objective--next frame prediction--trained on video data, results in a set of latent tokens which bind to specific scene structures and track them as they move. We demonstrate the usefulness of MooG's learned representation both qualitatively and quantitatively by training readouts on top of the learned representation on a variety of downstream tasks. We show that MooG can provide a strong foundation for different vision tasks when compared to "on-the-grid" baselines.

Moving Off-the-Grid: Scene-Grounded Video Representations

TL;DR

This work presents Moving Off-the-Grid (MooG), a self-supervised video representation model that offers an alternative approach, allowing tokens to move"off-the-grid" to better enable them to represent scene elements consistently, even as they move across the image plane through time.

Abstract

Current vision models typically maintain a fixed correspondence between their representation structure and image space. Each layer comprises a set of tokens arranged "on-the-grid," which biases patches or tokens to encode information at a specific spatio(-temporal) location. In this work we present Moving Off-the-Grid (MooG), a self-supervised video representation model that offers an alternative approach, allowing tokens to move "off-the-grid" to better enable them to represent scene elements consistently, even as they move across the image plane through time. By using a combination of cross-attention and positional embeddings we disentangle the representation structure and image structure. We find that a simple self-supervised objective--next frame prediction--trained on video data, results in a set of latent tokens which bind to specific scene structures and track them as they move. We demonstrate the usefulness of MooG's learned representation both qualitatively and quantitatively by training readouts on top of the learned representation on a variety of downstream tasks. We show that MooG can provide a strong foundation for different vision tasks when compared to "on-the-grid" baselines.

Paper Structure

This paper contains 53 sections, 4 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: MooG is a recurrent, transformer-based, video representation model that can be unrolled through time. MooG learns a set of "off-the-grid" latent representation. The model first predicts a predicted state based on the previous model state and observation. The current observation is then encoded and cross-attended to using the predicted state as queries to produce a correction to the prediction. When training, the predicted state is decoded with cross-attention using pixel coordinates as queries in order to reconstruct the current frame. The corrected state is used as input to the predictor to produce the next time step prediction, and so on. The model is trained to minimize the pixel prediction error. By decoupling the latent structure from the image grid structure the model is able to learn tokens that track scene content through time.
  • Figure 2: For comparison to MooG we here depict a classic "on-the-grid" model where tokens in the latent state are inherently tied to specific pixel locations.
  • Figure 3: Readout decoders overview: for grid-based readouts (e.g. pixels), we use a simple per-frame cross-attention architecture with spatial coordinates as queries, whereas for set-based readouts (points, boxes), we adopt a recurrent readout architecture.
  • Figure 4: Qualitative analysis of MooG trained on natural videos, shown here are every 4 frames of the original 36 frame long sequence. From top to bottom: Ground truth frames, predicted frames, example MooG token attention map super-imposed on the ground truth frames, example token attention from the recurrent grid-based baseline (see text for details). As can be seen the model is able to predict the next frame well, blurring when there is fast motion or unknown elements enter the scene. The MooG attention map indicates that the visualized token tracks the scene element it binds to across the full range of motion. In contrast, the grid-based token attention map demonstrates how these tokens end up being associated with a specific image location that does not track the scene content. Please see the supplementary material (and website) for other representative examples.
  • Figure 5: PCA of MooG tokens unrolled over a batch of short sequences. The model was unrolled over a batch of 24 sequences, 12 frames each. Predicted states from all time steps and batch samples were concatenated and PCA analysis was performed on the entire set jointly. We then reshape the projected set back to its original shape and use the arg-max token to visualize the result in image space (see text and Appendix for full details). Depicted are 3 of the leading PCA components in RGB. Note the salient high-level scene structure (e.g hands) learned by the model.
  • ...and 2 more figures