Table of Contents
Fetching ...

TRecViT: A Recurrent Video Transformer

Viorica Pătrăucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, João Carreira, Razvan Pascanu

TL;DR

This model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having $3\times less parameters, $12\times$ smaller memory footprint, and $5\times$ lower FLOPs count.

Abstract

We propose a novel block for video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture TRecViT performs well on sparse and dense tasks, trained in supervised or self-supervised regimes. Notably, our model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having $3\times$ less parameters, $12\times$ smaller memory footprint, and $5\times$ lower FLOPs count. Code and checkpoints will be made available online at https://github.com/google-deepmind/trecvit.

TRecViT: A Recurrent Video Transformer

TL;DR

This model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having 12\times5\times$ lower FLOPs count.

Abstract

We propose a novel block for video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture TRecViT performs well on sparse and dense tasks, trained in supervised or self-supervised regimes. Notably, our model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having less parameters, smaller memory footprint, and lower FLOPs count. Code and checkpoints will be made available online at https://github.com/google-deepmind/trecvit.

Paper Structure

This paper contains 20 sections, 1 equation, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Left: TRecViT architecture. Each video frame is divided into non-overlapping patches that are linearly projected into a token embedding space. We then add a learnt spatial positional encoding. The tokens are passed through gated linear recurrent units (LRUs) that share parameters across space. The outputs of the recurrent blocks are then processed by a ViT block. The recurrent operation followed by ViT is repeated N times. Right: TRecViT block. The input is a batch of videos, each frame with N tokens. We apply recurrent units over temporal tubes to integrate information over time, and self-attention and MLP across tokens within each frame. Note that the recurrent units share parameters, but the information is not mixed across temporal tubes. Similarly, the ViT blocks share parameters, but the information is not mixed across frames.
  • Figure 2: Distribution of the eigenvalues of the recurrent matrix at the beginning and end of training on long video memorisation task (see subsection \ref{['sec:longtask']}) for different initialisation ranges.
  • Figure 3: Our model demonstrates increasingly greater memory and compute savings compared to ViViT baselines as the number of frames increases. For clarity, TRecViT's peak memory (left figure) goes from about 4G for 8 frames to 22.4G for 64 frames, but this increase is dwarfed by ViViT's increase, hence TRecViT line appears almost horizontal
  • Figure 4: TRecViT compared to baselines on supervised video classification on SSv2 dataset, trained from scratch. The plot shows the evolution of the evaluation accuracy as training progresses.
  • Figure 5: Qualitative results obtained by TRecViT for point tracking on DAVIS dataset compared to VideoMAE. The leftmost image indicates the point to track in the original frame, and the images towards the right show zoom-ins on subsequent frames. Green plus (+) marker indicates the ground truth, yellow circle indicates TRecViT's predictions and red circles indicate VideoMAE's predictions.
  • ...and 5 more figures