Table of Contents
Fetching ...

VMRNN: Integrating Vision Mamba and LSTM for Efficient and Accurate Spatiotemporal Forecasting

Yujin Tang, Peijie Dong, Zhenheng Tang, Xiaowen Chu, Junwei Liang

TL;DR

This work tackles spatiotemporal forecasting by addressing the challenge of modeling long-range global dependencies efficiently. It introduces VMRNN, a recurrent cell that fuses Vision Mamba blocks with an LSTM, operating on patch-embedded frames; two architectures VM RNN-B and VM RNN-D are proposed to adjust depth and resolution via patch merging/expanding. Empirical results on Moving MNIST, TaxiBJ, and KTH show competitive accuracy with substantially fewer parameters and FLOPs compared to state-of-the-art, validating the approach's efficiency. The work provides a new baseline for vision-based spatiotemporal forecasting and highlights the potential of Vision Mamba blocks in sequential video modeling.

Abstract

Combining CNNs or ViTs, with RNNs for spatiotemporal forecasting, has yielded unparalleled results in predicting temporal and spatial dynamics. However, modeling extensive global information remains a formidable challenge; CNNs are limited by their narrow receptive fields, and ViTs struggle with the intensive computational demands of their attention mechanisms. The emergence of recent Mamba-based architectures has been met with enthusiasm for their exceptional long-sequence modeling capabilities, surpassing established vision models in efficiency and accuracy, which motivates us to develop an innovative architecture tailored for spatiotemporal forecasting. In this paper, we propose the VMRNN cell, a new recurrent unit that integrates the strengths of Vision Mamba blocks with LSTM. We construct a network centered on VMRNN cells to tackle spatiotemporal prediction tasks effectively. Our extensive evaluations show that our proposed approach secures competitive results on a variety of tasks while maintaining a smaller model size. Our code is available at https://github.com/yyyujintang/VMRNN-PyTorch.

VMRNN: Integrating Vision Mamba and LSTM for Efficient and Accurate Spatiotemporal Forecasting

TL;DR

This work tackles spatiotemporal forecasting by addressing the challenge of modeling long-range global dependencies efficiently. It introduces VMRNN, a recurrent cell that fuses Vision Mamba blocks with an LSTM, operating on patch-embedded frames; two architectures VM RNN-B and VM RNN-D are proposed to adjust depth and resolution via patch merging/expanding. Empirical results on Moving MNIST, TaxiBJ, and KTH show competitive accuracy with substantially fewer parameters and FLOPs compared to state-of-the-art, validating the approach's efficiency. The work provides a new baseline for vision-based spatiotemporal forecasting and highlights the potential of Vision Mamba blocks in sequential video modeling.

Abstract

Combining CNNs or ViTs, with RNNs for spatiotemporal forecasting, has yielded unparalleled results in predicting temporal and spatial dynamics. However, modeling extensive global information remains a formidable challenge; CNNs are limited by their narrow receptive fields, and ViTs struggle with the intensive computational demands of their attention mechanisms. The emergence of recent Mamba-based architectures has been met with enthusiasm for their exceptional long-sequence modeling capabilities, surpassing established vision models in efficiency and accuracy, which motivates us to develop an innovative architecture tailored for spatiotemporal forecasting. In this paper, we propose the VMRNN cell, a new recurrent unit that integrates the strengths of Vision Mamba blocks with LSTM. We construct a network centered on VMRNN cells to tackle spatiotemporal prediction tasks effectively. Our extensive evaluations show that our proposed approach secures competitive results on a variety of tasks while maintaining a smaller model size. Our code is available at https://github.com/yyyujintang/VMRNN-PyTorch.
Paper Structure (16 sections, 3 equations, 7 figures, 6 tables)

This paper contains 16 sections, 3 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Performance comparison on TaxiBJ over spatial-temporal predictive learning methods. VMRNN outperforms previous methods in terms of Mean-Squared-Error (MSE, lower the better) with a lower computational cost (GFLOPs).
  • Figure 2: (a): The detailed structure of the proposed recurrent cell: VMRNN. $\textbf{VSB}$ and $\textbf{LP}$ denote VSS Block and Linear Projection. (b): The architecture of VSS Block. (c): The SS2D process, includes three stages: Scan Expand, S6 Block, and Scan Merge.
  • Figure 3: (a): The architecture of the base model with a single VMRNN cell, VMRNN-B. (b): The architecture of the deep model with multiple VMRNN cells, VMRNN-D.
  • Figure 4: Qualitative results of VMRNN on KTH.
  • Figure 5: Ablation study on the different numbers of VSS Block with VMRNN on TaxiBJ.
  • ...and 2 more figures