MS-RNN: A Flexible Multi-Scale Framework for Spatiotemporal Predictive Learning

Zhifeng Ma; Hao Zhang; Jie Liu

MS-RNN: A Flexible Multi-Scale Framework for Spatiotemporal Predictive Learning

Zhifeng Ma, Hao Zhang, Jie Liu

TL;DR

MS-RNN introduces a flexible, memory-efficient multi-scale framework that unifies a range of RNN-based video prediction models. By adding a mirror pyramid embedding with downsampling, upsampling, and skip connections, it enlarges the spatiotemporal receptive field while reducing outputs memory and FLOPs. Theoretical analyses and extensive experiments across Moving MNIST, TaxiBJ, KTH, and Germany demonstrate memory savings up to approximately $56\%$ and clear performance gains over baselines, validating coarse-to-fine video synthesis. This approach enables high-resolution, real-time spatiotemporal forecasting in resource-constrained settings and provides practical design guidance for future video prediction systems.

Abstract

Spatiotemporal predictive learning, which predicts future frames through historical prior knowledge with the aid of deep learning, is widely used in many fields. Previous work essentially improves the model performance by widening or deepening the network, but it also brings surging memory overhead, which seriously hinders the development and application of this technology. In order to improve the performance without increasing memory consumption, we focus on scale, which is another dimension to improve model performance but with low memory requirement. The effectiveness has been widely demonstrated in many CNN-based tasks such as image classification and semantic segmentation, but it has not been fully explored in recent RNN models. In this paper, learning from the benefit of multi-scale, we propose a general framework named Multi-Scale RNN (MS-RNN) to boost recent RNN models for spatiotemporal predictive learning. We verify the MS-RNN framework by thorough theoretical analyses and exhaustive experiments, where the theory focuses on memory reduction and performance improvement while the experiments employ eight RNN models (ConvLSTM, TrajGRU, PredRNN, PredRNN++, MIM, MotionRNN, PredRNN-V2, and PrecipLSTM) and four datasets (Moving MNIST, TaxiBJ, KTH, and Germany). The results show the efficiency that RNN models incorporating our framework have much lower memory cost but better performance than before. Our code is released at \url{https://github.com/mazhf/MS-RNN}.

MS-RNN: A Flexible Multi-Scale Framework for Spatiotemporal Predictive Learning

TL;DR

and clear performance gains over baselines, validating coarse-to-fine video synthesis. This approach enables high-resolution, real-time spatiotemporal forecasting in resource-constrained settings and provides practical design guidance for future video prediction systems.

Abstract

Paper Structure (36 sections, 14 equations, 9 figures, 11 tables)

This paper contains 36 sections, 14 equations, 9 figures, 11 tables.

Introduction
Related Work
Preliminaries
Problem Formulation
ConvLSTM
MS-RNN
Framework
Struture Unification
Multi-scale Embedding
Analysis of Memory Reduction
Analysis of FLOPs Reduction
Analysis of Performance Improvement
Experiments
Implementation Details
Moving MNIST
...and 21 more sections

Figures (9)

Figure 1: Comparison of memory usage and performance of RNN and MS-RNN. Given a fixed image size, the memory footprint of advanced models (e.g., ConvLSTM $\rightarrow$ PredRNN++ $\rightarrow$ MotionRNN) is getting larger and larger. On the contrary, our proposed multi-scale framework can greatly reduce their memory footprint and brings additional improvement. Meanwhile, for a fixed memory footprint, our framework can make the basic models handle larger images, which expands the serviceable scope of the basic models.
Figure 2: First, we integrate ConvLSTM shi2015convolutional, TrajGRU shi2017deep, PredRNN wang2017predrnn, PredRNN++ wang2018predrnn++, MIM wang2019memory, MotionRNN wu2021motionrnn, PredRNN-V2 wang2022predrnn, and PrecipLSTM ma2022preciplstm into a unified RNN framework with the same layers (a), then we perform multi-scale processing to get the MS-RNN framework (b). Downsampling and upsampling operations are performed by maximum pooling (scale factor=2) and bilinear interpolation (scale factor=2) layers, respectively.
Figure 3: Qualitative comparison on the Moving MNIST dataset. The first row is the real frame, where the left is historical frames and the right is future frames. The other rows are predicted frames. This also applies to qualitative comparison on other datasets.
Figure 4: The receptive field and gradient of the ConvLSTM encoder and MS-ConvLSTM encoder on the Moving MNIST dataset. In each subfigure from left to right are the first layer, the second layer, and the third layer of the encoder.
Figure 5: The layer outputs of ConvLSTM and MS-ConvLSTM on the Moving MNIST dataset.
...and 4 more figures

MS-RNN: A Flexible Multi-Scale Framework for Spatiotemporal Predictive Learning

TL;DR

Abstract

MS-RNN: A Flexible Multi-Scale Framework for Spatiotemporal Predictive Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (9)