Table of Contents
Fetching ...

Snakes and Ladders: Two Steps Up for VideoMamba

Hui Lu, Albert Ali Salah, Ronald Poppe

TL;DR

This paper identifies two limitations in Mamba's token processing: historical decay and element contradiction and proposes VideoMambaPro (VMP) that solves the identified limitations by adding masked backward computation and elemental residual connections to a VideoMamba backbone.

Abstract

Video understanding requires the extraction of rich spatio-temporal representations, which transformer models achieve through self-attention. Unfortunately, self-attention poses a computational burden. In NLP, Mamba has surfaced as an efficient alternative for transformers. However, Mamba's successes do not trivially extend to vision tasks, including those in video analysis. In this paper, we theoretically analyze the differences between self-attention and Mamba. We identify two limitations in Mamba's token processing: historical decay and element contradiction. We propose VideoMambaPro (VMP) that solves the identified limitations by adding masked backward computation and elemental residual connections to a VideoMamba backbone. Differently sized VideoMambaPro models surpass VideoMamba by 1.6-2.8% and 1.1-1.9% top-1 on Kinetics-400 and Something-Something V2, respectively. Even without extensive pre-training, our models present an increasingly attractive and efficient alternative to current transformer models. Moreover, our two solutions are orthogonal to recent advances in Vision Mamba models, and are likely to provide further improvements in future models.

Snakes and Ladders: Two Steps Up for VideoMamba

TL;DR

This paper identifies two limitations in Mamba's token processing: historical decay and element contradiction and proposes VideoMambaPro (VMP) that solves the identified limitations by adding masked backward computation and elemental residual connections to a VideoMamba backbone.

Abstract

Video understanding requires the extraction of rich spatio-temporal representations, which transformer models achieve through self-attention. Unfortunately, self-attention poses a computational burden. In NLP, Mamba has surfaced as an efficient alternative for transformers. However, Mamba's successes do not trivially extend to vision tasks, including those in video analysis. In this paper, we theoretically analyze the differences between self-attention and Mamba. We identify two limitations in Mamba's token processing: historical decay and element contradiction. We propose VideoMambaPro (VMP) that solves the identified limitations by adding masked backward computation and elemental residual connections to a VideoMamba backbone. Differently sized VideoMambaPro models surpass VideoMamba by 1.6-2.8% and 1.1-1.9% top-1 on Kinetics-400 and Something-Something V2, respectively. Even without extensive pre-training, our models present an increasingly attractive and efficient alternative to current transformer models. Moreover, our two solutions are orthogonal to recent advances in Vision Mamba models, and are likely to provide further improvements in future models.

Paper Structure

This paper contains 17 sections, 27 equations, 5 figures, 15 tables.

Figures (5)

  • Figure 1: (a) Framework of VideoMambaPro with $K$ bi-directional Mamba blocks. (b) In each bi-directional Mamba block, we employ forward residual SSM and masked backward Residual SSM.
  • Figure 2: Relative accuracy per class on Kinetics-400 by comparing VideoMambaPro-M to a baseline VideoMamba-M. Classes sorted by relative performance.
  • Figure 3: Top-1 accuracy versus number of parameters of VideoMambaPro and other models on Kinetics-400.
  • Figure 4: Top-1 accuracy versus number of FLOPs of VideoMambaPro and other models on Kinetics-400.
  • Figure 5: Comparison between the bi-directional VideoMamba (top) and VideoMambaPro (bottom) blocks.