Table of Contents
Fetching ...

SimVPv2: Towards Simple yet Powerful Spatiotemporal Predictive Learning

Cheng Tan, Zhangyang Gao, Siyuan Li, Stan Z. Li

TL;DR

SimVPv2 tackles spatiotemporal predictive learning by eliminating Unet-like multi-scale components and introducing a gated spatiotemporal attention (gSTA) mechanism within a fully CNN-based, CNN-CNN-CNN architecture. The model achieves state-of-the-art performance with reduced parameters, FLOPs, and faster training/inference compared to prior methods, validated across eight diverse datasets including Moving MNIST, TaxiBJ, WeatherBench, and multi-domain benchmarks. Extensive experiments demonstrate strong generalization to different domains, long-horizon frame prediction, and efficiency gains, establishing SimVPv2 as a simple yet powerful baseline for spatiotemporal predictive learning. Overall, the work emphasizes that careful architectural simplification, coupled with gSTA, can outperform more complex recurrent or transformer-based approaches while offering practical benefits for real-world deployment.

Abstract

Recent years have witnessed remarkable advances in spatiotemporal predictive learning, with methods incorporating auxiliary inputs, complex neural architectures, and sophisticated training strategies. While SimVP has introduced a simpler, CNN-based baseline for this task, it still relies on heavy Unet-like architectures for spatial and temporal modeling, which still suffers from high complexity and computational overhead. In this paper, we propose SimVPv2, a streamlined model that eliminates the need for Unet architectures and demonstrates that plain stacks of convolutional layers, enhanced with an efficient Gated Spatiotemporal Attention mechanism, can deliver state-of-the-art performance. SimVPv2 not only simplifies the model architecture but also improves both performance and computational efficiency. On the standard Moving MNIST benchmark, SimVPv2 achieves superior performance compared to SimVP, with fewer FLOPs, about half the training time, and 60% faster inference efficiency. Extensive experiments across eight diverse datasets, including real-world tasks such as traffic forecasting and climate prediction, further demonstrate that SimVPv2 offers a powerful yet straightforward solution, achieving robust generalization across various spatiotemporal learning scenarios. We believe the proposed SimVPv2 can serve as a solid baseline to benefit the spatiotemporal predictive learning community.

SimVPv2: Towards Simple yet Powerful Spatiotemporal Predictive Learning

TL;DR

SimVPv2 tackles spatiotemporal predictive learning by eliminating Unet-like multi-scale components and introducing a gated spatiotemporal attention (gSTA) mechanism within a fully CNN-based, CNN-CNN-CNN architecture. The model achieves state-of-the-art performance with reduced parameters, FLOPs, and faster training/inference compared to prior methods, validated across eight diverse datasets including Moving MNIST, TaxiBJ, WeatherBench, and multi-domain benchmarks. Extensive experiments demonstrate strong generalization to different domains, long-horizon frame prediction, and efficiency gains, establishing SimVPv2 as a simple yet powerful baseline for spatiotemporal predictive learning. Overall, the work emphasizes that careful architectural simplification, coupled with gSTA, can outperform more complex recurrent or transformer-based approaches while offering practical benefits for real-world deployment.

Abstract

Recent years have witnessed remarkable advances in spatiotemporal predictive learning, with methods incorporating auxiliary inputs, complex neural architectures, and sophisticated training strategies. While SimVP has introduced a simpler, CNN-based baseline for this task, it still relies on heavy Unet-like architectures for spatial and temporal modeling, which still suffers from high complexity and computational overhead. In this paper, we propose SimVPv2, a streamlined model that eliminates the need for Unet architectures and demonstrates that plain stacks of convolutional layers, enhanced with an efficient Gated Spatiotemporal Attention mechanism, can deliver state-of-the-art performance. SimVPv2 not only simplifies the model architecture but also improves both performance and computational efficiency. On the standard Moving MNIST benchmark, SimVPv2 achieves superior performance compared to SimVP, with fewer FLOPs, about half the training time, and 60% faster inference efficiency. Extensive experiments across eight diverse datasets, including real-world tasks such as traffic forecasting and climate prediction, further demonstrate that SimVPv2 offers a powerful yet straightforward solution, achieving robust generalization across various spatiotemporal learning scenarios. We believe the proposed SimVPv2 can serve as a solid baseline to benefit the spatiotemporal predictive learning community.
Paper Structure (27 sections, 14 equations, 11 figures, 8 tables)

This paper contains 27 sections, 14 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Major categories of the architectures for spatiotemporal predictive learning. The red and blue dotted line are available to learn the temporal evolution and spatial dependency. Our proposed SimVP and SimVPv2 belong to (d) CNN-CNN-CNN, which can outperform other state-of-the-art methods.
  • Figure 2: The overall framework of SimVP and SimVPv2.
  • Figure 3: The schematic diagram of the autoencoder and our proposed SimVP. While the autoencoder focuses on a single frame at a static time, SimVP concerns a sequence of frames at a dynamic time. The first row denotes the ground-truth frames, and the second denotes the predicted frames. From left to right, the data changes over time.
  • Figure 4: The spatial encoder and decoder perform single-frame level spatial feature extraction and reconstruction. The translator learns from multi-frame level temporal dependencies.
  • Figure 5: (a-b) The Inception temporal module and corresponding Inception-Unet translator. (c-d) The gSTA module and corresponding gSTA translator.
  • ...and 6 more figures