Table of Contents
Fetching ...

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

Guo Chen, Yifei Huang, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, Limin Wang

TL;DR

The paper investigates the potential of State Space Models, embodied by Mamba, as a versatile and efficient alternative to Transformers for video understanding. It introduces the Video Mamba Suite, a collection of 14 Mamba-based modules deployed across four roles to tackle 12 video tasks, including video-text interactions, with extensive experiments on 13 datasets. Across temporal, cross-modal, and spatial-temporal settings, Mamba variants demonstrate competitive or superior performance and favorable efficiency trade-offs, highlighting linear-scaling advantages for long sequences. The work provides a comprehensive resource for future research in video understanding using SSMs and releases public code to facilitate further exploration.

Abstract

Understanding videos is one of the fundamental directions in computer vision research, with extensive efforts dedicated to exploring various architectures such as RNN, 3D CNN, and Transformers. The newly proposed architecture of state space model, e.g., Mamba, shows promising traits to extend its success in long sequence modeling to video modeling. To assess whether Mamba can be a viable alternative to Transformers in the video understanding domain, in this work, we conduct a comprehensive set of studies, probing different roles Mamba can play in modeling videos, while investigating diverse tasks where Mamba could exhibit superiority. We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks. Our extensive experiments reveal the strong potential of Mamba on both video-only and video-language tasks while showing promising efficiency-performance trade-offs. We hope this work could provide valuable data points and insights for future research on video understanding. Code is public: https://github.com/OpenGVLab/video-mamba-suite.

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

TL;DR

The paper investigates the potential of State Space Models, embodied by Mamba, as a versatile and efficient alternative to Transformers for video understanding. It introduces the Video Mamba Suite, a collection of 14 Mamba-based modules deployed across four roles to tackle 12 video tasks, including video-text interactions, with extensive experiments on 13 datasets. Across temporal, cross-modal, and spatial-temporal settings, Mamba variants demonstrate competitive or superior performance and favorable efficiency trade-offs, highlighting linear-scaling advantages for long sequences. The work provides a comprehensive resource for future research in video understanding using SSMs and releases public code to facilitate further exploration.

Abstract

Understanding videos is one of the fundamental directions in computer vision research, with extensive efforts dedicated to exploring various architectures such as RNN, 3D CNN, and Transformers. The newly proposed architecture of state space model, e.g., Mamba, shows promising traits to extend its success in long sequence modeling to video modeling. To assess whether Mamba can be a viable alternative to Transformers in the video understanding domain, in this work, we conduct a comprehensive set of studies, probing different roles Mamba can play in modeling videos, while investigating diverse tasks where Mamba could exhibit superiority. We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks. Our extensive experiments reveal the strong potential of Mamba on both video-only and video-language tasks while showing promising efficiency-performance trade-offs. We hope this work could provide valuable data points and insights for future research on video understanding. Code is public: https://github.com/OpenGVLab/video-mamba-suite.
Paper Structure (35 sections, 1 equation, 10 figures, 20 tables)

This paper contains 35 sections, 1 equation, 10 figures, 20 tables.

Figures (10)

  • Figure 1: We investigate SSMs exemplified by Mamba on video understanding. Our Video Mamba Suite comprises 14 SSM models/modules for 12 video understanding tasks. We explore 4 roles of SSM in video modeling and conduct extensive experiments on 13 major datasets.
  • Figure 2: Illustration of three SSMs blocks. (a) is the vanilla Mamba block gu2023mamba. (b) is the ViM block zhu2024vim. (c) is our proposed DBM block, which separates the input projector and shares the parameters of SSM in both scanning directions.
  • Figure 3: Illustration for different positions of video and text tokens in the input sequence.
  • Figure 4: Illustration of our explored structures. (a) and (b) shows vanilla-style timesformer and frozen-style frozenintime residual connection forms for TimeSformer timesformer. (c) and (d) presents our created TimeMamba which uses ViM block as a temporal module in both styles. (e) provides the replacement of the temporal ViM block with a space-time ViM block.
  • Figure 5: The results of using different numbers of testing frames for zero-shot QA on EgoSchema mangalam2024egoschema. The model is trained on Ego4D ego4d with 4 frames.
  • ...and 5 more figures