USTEP: Spatio-Temporal Predictive Learning under A Unified View
Cheng Tan, Jue Wang, Zhangyang Gao, Siyuan Li, Stan Z. Li
TL;DR
USTEP addresses the fragmentation between recurrent-based and recurrent-free spatio-temporal predictive learning by unifying micro-temporal and macro-temporal scales into a dual-module framework with cross-segment gating. The approach defines temporal scale sets, performs single-segment temporal modeling in a shared feature space, and fuses information across scales to capture both short-term dynamics and long-range context with high efficiency. Empirical results across Moving MNIST, KTH, WeatherBench, Caltech Pedestrian, SEVIR, and UCF Sports show state-of-the-art or competitive accuracy with substantially reduced parameters and FLOPs compared to fully recurrent models, while preserving flexibility across tasks with equal, extended, and reduced frame predictions. The work offers practical guidelines for choosing temporal strides and demonstrates strong generalization and deployment practicality on diverse hardware, marking a significant step toward scalable, unified spatio-temporal forecasting.
Abstract
Spatio-temporal predictive learning plays a crucial role in self-supervised learning, with wide-ranging applications across a diverse range of fields. Previous approaches for temporal modeling fall into two categories: recurrent-based and recurrent-free methods. The former, while meticulously processing frames one by one, neglect short-term spatio-temporal information redundancies, leading to inefficiencies. The latter naively stack frames sequentially, overlooking the inherent temporal dependencies. In this paper, we re-examine the two dominant temporal modeling approaches within the realm of spatio-temporal predictive learning, offering a unified perspective. Building upon this analysis, we introduce USTEP (Unified Spatio-TEmporal Predictive learning), an innovative framework that reconciles the recurrent-based and recurrent-free methods by integrating both micro-temporal and macro-temporal scales. Extensive experiments on a wide range of spatio-temporal predictive learning demonstrate that USTEP achieves significant improvements over existing temporal modeling approaches, thereby establishing it as a robust solution for a wide range of spatio-temporal applications.
