MS-LSTM: Exploring Spatiotemporal Multiscale Representations in Video Prediction Domain
Zhifeng Ma, Hao Zhang, Jie Liu
TL;DR
MS-LSTM tackles the inefficiency of deepening or widening video prediction models by introducing a spatiotemporal multiscale architecture. It combines SMS-LSTM (depth/downsampling with a mirrored pyramid decoder) and MK-LSTM (multi-kernel temporal memory) to expand the spatiotemporal receptive field while controlling training cost. The approach is supported by a theoretical analysis of cost and extensive experiments on Moving MNIST, TaxiBJ, KTH, and German precipitation nowcasting, showing improved performance with competitive or reduced resource usage compared to baseline ConvRNNs and competing multiscale models. The results indicate that explicit multiscale design offers robust long-term predictions and practical efficiency for high-resolution video prediction tasks.
Abstract
The drastic variation of motion in spatial and temporal dimensions makes the video prediction task extremely challenging. Existing RNN models obtain higher performance by deepening or widening the model. They obtain the multi-scale features of the video only by stacking layers, which is inefficient and brings unbearable training costs (such as memory, FLOPs, and training time). Different from them, this paper proposes a spatiotemporal multi-scale model called MS-LSTM wholly from a multi-scale perspective. On the basis of stacked layers, MS-LSTM incorporates two additional efficient multi-scale designs to fully capture spatiotemporal context information. Concretely, we employ LSTMs with mirrored pyramid structures to construct spatial multi-scale representations and LSTMs with different convolution kernels to construct temporal multi-scale representations. We theoretically analyze the training cost and performance of MS-LSTM and its components. Detailed comparison experiments with twelve baseline models on four video datasets show that MS-LSTM has better performance but lower training costs.
