Table of Contents
Fetching ...

USTEP: Spatio-Temporal Predictive Learning under A Unified View

Cheng Tan, Jue Wang, Zhangyang Gao, Siyuan Li, Stan Z. Li

TL;DR

USTEP addresses the fragmentation between recurrent-based and recurrent-free spatio-temporal predictive learning by unifying micro-temporal and macro-temporal scales into a dual-module framework with cross-segment gating. The approach defines temporal scale sets, performs single-segment temporal modeling in a shared feature space, and fuses information across scales to capture both short-term dynamics and long-range context with high efficiency. Empirical results across Moving MNIST, KTH, WeatherBench, Caltech Pedestrian, SEVIR, and UCF Sports show state-of-the-art or competitive accuracy with substantially reduced parameters and FLOPs compared to fully recurrent models, while preserving flexibility across tasks with equal, extended, and reduced frame predictions. The work offers practical guidelines for choosing temporal strides and demonstrates strong generalization and deployment practicality on diverse hardware, marking a significant step toward scalable, unified spatio-temporal forecasting.

Abstract

Spatio-temporal predictive learning plays a crucial role in self-supervised learning, with wide-ranging applications across a diverse range of fields. Previous approaches for temporal modeling fall into two categories: recurrent-based and recurrent-free methods. The former, while meticulously processing frames one by one, neglect short-term spatio-temporal information redundancies, leading to inefficiencies. The latter naively stack frames sequentially, overlooking the inherent temporal dependencies. In this paper, we re-examine the two dominant temporal modeling approaches within the realm of spatio-temporal predictive learning, offering a unified perspective. Building upon this analysis, we introduce USTEP (Unified Spatio-TEmporal Predictive learning), an innovative framework that reconciles the recurrent-based and recurrent-free methods by integrating both micro-temporal and macro-temporal scales. Extensive experiments on a wide range of spatio-temporal predictive learning demonstrate that USTEP achieves significant improvements over existing temporal modeling approaches, thereby establishing it as a robust solution for a wide range of spatio-temporal applications.

USTEP: Spatio-Temporal Predictive Learning under A Unified View

TL;DR

USTEP addresses the fragmentation between recurrent-based and recurrent-free spatio-temporal predictive learning by unifying micro-temporal and macro-temporal scales into a dual-module framework with cross-segment gating. The approach defines temporal scale sets, performs single-segment temporal modeling in a shared feature space, and fuses information across scales to capture both short-term dynamics and long-range context with high efficiency. Empirical results across Moving MNIST, KTH, WeatherBench, Caltech Pedestrian, SEVIR, and UCF Sports show state-of-the-art or competitive accuracy with substantially reduced parameters and FLOPs compared to fully recurrent models, while preserving flexibility across tasks with equal, extended, and reduced frame predictions. The work offers practical guidelines for choosing temporal strides and demonstrates strong generalization and deployment practicality on diverse hardware, marking a significant step toward scalable, unified spatio-temporal forecasting.

Abstract

Spatio-temporal predictive learning plays a crucial role in self-supervised learning, with wide-ranging applications across a diverse range of fields. Previous approaches for temporal modeling fall into two categories: recurrent-based and recurrent-free methods. The former, while meticulously processing frames one by one, neglect short-term spatio-temporal information redundancies, leading to inefficiencies. The latter naively stack frames sequentially, overlooking the inherent temporal dependencies. In this paper, we re-examine the two dominant temporal modeling approaches within the realm of spatio-temporal predictive learning, offering a unified perspective. Building upon this analysis, we introduce USTEP (Unified Spatio-TEmporal Predictive learning), an innovative framework that reconciles the recurrent-based and recurrent-free methods by integrating both micro-temporal and macro-temporal scales. Extensive experiments on a wide range of spatio-temporal predictive learning demonstrate that USTEP achieves significant improvements over existing temporal modeling approaches, thereby establishing it as a robust solution for a wide range of spatio-temporal applications.
Paper Structure (22 sections, 6 equations, 9 figures, 9 tables, 2 algorithms)

This paper contains 22 sections, 6 equations, 9 figures, 9 tables, 2 algorithms.

Figures (9)

  • Figure 1: Frame-by-frame MSE/MAE comparison between the representative recurrent-based method PredRNN and the recurrent-free method SimVP on the extended frame task using the KTH dataset. The plot illustrates the differences in performance across individual frames.
  • Figure 2: Temporal modeling comparison between recurrent-based, recurrent-free and our unified temporal modeling.
  • Figure 3: The illustration of micro- and macro-temporal scales. Here we take a $4 \rightarrow 4$ frames prediction as an example. Each green circle represents an individual frame. Micro-temporal scales (in red) divide the sequence into non-overlapping temporal segments, containing a few consecutive frames. Macro-temporal scale (in blue) further divides the sequence into large temporal segments. The number of $\mathcal{V}$ is $|\mathcal{U}|-1$.
  • Figure 4: Illustration of USTEP’s unified spatio-temporal predictive learning framework. (a) Temporal segment partition: USTEP constructs macro-temporal segments using both the previous and the current micro-temporal segments. This approach allows the model to eliminate short-term redundancies while preserving long-term context. (b) Detailed architecture of USTEP: The framework consists of two specialized recurrent-free modules, $F^U_{\theta_1}$ and $F^V_{\theta_2}$, which handle channel mixing for micro- and macro-temporal scales, respectively. Hidden states from both scales are integrated through a gating mechanism and cross-segment-level temporal modeling, ensuring comprehensive spatio-temporal predictive learning.
  • Figure 5: The qualitative visualization on Moving MNIST.
  • ...and 4 more figures