TRecViT: A Recurrent Video Transformer

Viorica Pătrăucean; Xu Owen He; Joseph Heyward; Chuhan Zhang; Mehdi S. M. Sajjadi; George-Cristian Muraru; Artem Zholus; Mahdi Karami; Ross Goroshin; Yutian Chen; Simon Osindero; João Carreira; Razvan Pascanu

TRecViT: A Recurrent Video Transformer

Viorica Pătrăucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, João Carreira, Razvan Pascanu

TL;DR

This model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having $3\times less parameters, $12\times$ smaller memory footprint, and $5\times$ lower FLOPs count.

Abstract

We propose a novel block for video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture TRecViT performs well on sparse and dense tasks, trained in supervised or self-supervised regimes. Notably, our model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having $3\times$ less parameters, $12\times$ smaller memory footprint, and $5\times$ lower FLOPs count. Code and checkpoints will be made available online at https://github.com/google-deepmind/trecvit.

TRecViT: A Recurrent Video Transformer

TL;DR

Abstract

TRecViT: A Recurrent Video Transformer

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)