Table of Contents
Fetching ...

SIAM: A Simple Alternating Mixer for Video Prediction

Xin Zheng, Ziang Peng, Yuan Cao, Hongming Shan, Junping Zhang

TL;DR

SIAM addresses the challenge of generic video prediction by unifying spatial, temporal, and spatiotemporal feature modeling within a latent-space encoder–decoder framework. Its DaMi block, comprising Spatial, Spatiotemporal, and Temporal Mixers, alternates processing across dimensions to progressively refine past-frame representations into future frames. The approach achieves state-of-the-art performance across four diverse datasets (Moving MNIST, TaxiBJ, WeatherBench, Human3.6M) while maintaining efficiency, demonstrating robustness across synthetic and real-world scenarios. This modular, simple design offers a scalable path for improved video forecasting and motivates future work on more attentive Mixer variants.

Abstract

Video prediction, predicting future frames from the previous ones, has broad applications such as autonomous driving and weather forecasting. Existing state-of-the-art methods typically focus on extracting either spatial, temporal, or spatiotemporal features from videos. Different feature focuses, resulting from different network architectures, may make the resultant models excel at some video prediction tasks but perform poorly on others. Towards a more generic video prediction solution, we explicitly model these features in a unified encoder-decoder framework and propose a novel simple alternating Mixer (SIAM). The novelty of SIAM lies in the design of dimension alternating mixing (DaMi) blocks, which can model spatial, temporal, and spatiotemporal features through alternating the dimensions of the feature maps. Extensive experimental results demonstrate the superior performance of the proposed SIAM on four benchmark video datasets covering both synthetic and real-world scenarios.

SIAM: A Simple Alternating Mixer for Video Prediction

TL;DR

SIAM addresses the challenge of generic video prediction by unifying spatial, temporal, and spatiotemporal feature modeling within a latent-space encoder–decoder framework. Its DaMi block, comprising Spatial, Spatiotemporal, and Temporal Mixers, alternates processing across dimensions to progressively refine past-frame representations into future frames. The approach achieves state-of-the-art performance across four diverse datasets (Moving MNIST, TaxiBJ, WeatherBench, Human3.6M) while maintaining efficiency, demonstrating robustness across synthetic and real-world scenarios. This modular, simple design offers a scalable path for improved video forecasting and motivates future work on more attentive Mixer variants.

Abstract

Video prediction, predicting future frames from the previous ones, has broad applications such as autonomous driving and weather forecasting. Existing state-of-the-art methods typically focus on extracting either spatial, temporal, or spatiotemporal features from videos. Different feature focuses, resulting from different network architectures, may make the resultant models excel at some video prediction tasks but perform poorly on others. Towards a more generic video prediction solution, we explicitly model these features in a unified encoder-decoder framework and propose a novel simple alternating Mixer (SIAM). The novelty of SIAM lies in the design of dimension alternating mixing (DaMi) blocks, which can model spatial, temporal, and spatiotemporal features through alternating the dimensions of the feature maps. Extensive experimental results demonstrate the superior performance of the proposed SIAM on four benchmark video datasets covering both synthetic and real-world scenarios.
Paper Structure (16 sections, 3 equations, 8 figures, 4 tables)

This paper contains 16 sections, 3 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Comparison of different architectures for handling video data. (a) A sequence of frames are passed through RNNs, and one prediction is made at each timestep. (b) 2D-CNNs manage videos the same as images by stacking all input frames over the channel dimension. (c) ViViT- and 3D-CNN-based models operate on patches in space and time simultaneously: ViViTs leverage the attention mechanism to process all patches globally, while 3D CNNs have relatively local receptive fields.
  • Figure 2: The overall architecture of the proposed SIAM and the detailed structure of the DaMi block.
  • Figure 3: Predicted results on the M-MNIST dataset.
  • Figure 4: Predicted results on the TaxiBJ dataset. MAE denotes the absolute error between the predicted results and the corresponding targets.
  • Figure B1: Predicted results on the M-MNIST dataset.
  • ...and 3 more figures