Table of Contents
Fetching ...

UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines

Chen Tang, Xinzhu Ma, Encheng Su, Xiufeng Song, Xiaohong Liu, Wei-Hong Li, Lei Bai, Wanli Ouyang, Xiangyu Yue

TL;DR

This work introduces UniSTD, a unified Transformer-based framework for spatiotemporal learning across multiple disciplines. It adopts a two-stage optimization: broad, task-agnostic pretraining on 2D vision and vision-text data, followed by specialized joint training on diverse spatiotemporal tasks to enable cross-task generalization. A key contribution is the rank-adaptive mixture-of-experts (AdaMoE) with a continuous relaxation of ranks, paired with a lightweight temporal module, which together decouple spatial and temporal modeling and support scalable multi-task learning. Evaluations on a large-scale benchmark spanning four disciplines and ten tasks demonstrate strong performance gains and efficient cross-task adaptation, indicating the potential for general-purpose, multi-domain spatiotemporal learning with reduced per-task design costs.

Abstract

Traditional spatiotemporal models generally rely on task-specific architectures, which limit their generalizability and scalability across diverse tasks due to domain-specific design requirements. In this paper, we introduce \textbf{UniSTD}, a unified Transformer-based framework for spatiotemporal modeling, which is inspired by advances in recent foundation models with the two-stage pretraining-then-adaption paradigm. Specifically, our work demonstrates that task-agnostic pretraining on 2D vision and vision-text datasets can build a generalizable model foundation for spatiotemporal learning, followed by specialized joint training on spatiotemporal datasets to enhance task-specific adaptability. To improve the learning capabilities across domains, our framework employs a rank-adaptive mixture-of-expert adaptation by using fractional interpolation to relax the discrete variables so that can be optimized in the continuous space. Additionally, we introduce a temporal module to incorporate temporal dynamics explicitly. We evaluate our approach on a large-scale dataset covering 10 tasks across 4 disciplines, demonstrating that a unified spatiotemporal model can achieve scalable, cross-task learning and support up to 10 tasks simultaneously within one model while reducing training costs in multi-domain applications. Code will be available at https://github.com/1hunters/UniSTD.

UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines

TL;DR

This work introduces UniSTD, a unified Transformer-based framework for spatiotemporal learning across multiple disciplines. It adopts a two-stage optimization: broad, task-agnostic pretraining on 2D vision and vision-text data, followed by specialized joint training on diverse spatiotemporal tasks to enable cross-task generalization. A key contribution is the rank-adaptive mixture-of-experts (AdaMoE) with a continuous relaxation of ranks, paired with a lightweight temporal module, which together decouple spatial and temporal modeling and support scalable multi-task learning. Evaluations on a large-scale benchmark spanning four disciplines and ten tasks demonstrate strong performance gains and efficient cross-task adaptation, indicating the potential for general-purpose, multi-domain spatiotemporal learning with reduced per-task design costs.

Abstract

Traditional spatiotemporal models generally rely on task-specific architectures, which limit their generalizability and scalability across diverse tasks due to domain-specific design requirements. In this paper, we introduce \textbf{UniSTD}, a unified Transformer-based framework for spatiotemporal modeling, which is inspired by advances in recent foundation models with the two-stage pretraining-then-adaption paradigm. Specifically, our work demonstrates that task-agnostic pretraining on 2D vision and vision-text datasets can build a generalizable model foundation for spatiotemporal learning, followed by specialized joint training on spatiotemporal datasets to enhance task-specific adaptability. To improve the learning capabilities across domains, our framework employs a rank-adaptive mixture-of-expert adaptation by using fractional interpolation to relax the discrete variables so that can be optimized in the continuous space. Additionally, we introduce a temporal module to incorporate temporal dynamics explicitly. We evaluate our approach on a large-scale dataset covering 10 tasks across 4 disciplines, demonstrating that a unified spatiotemporal model can achieve scalable, cross-task learning and support up to 10 tasks simultaneously within one model while reducing training costs in multi-domain applications. Code will be available at https://github.com/1hunters/UniSTD.

Paper Structure

This paper contains 16 sections, 11 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Top: Existing works need specialized models for both tasks within the same disciplines (e.g., weather forecasting, traffic control). Bottom: Unified Spatio-Temporal Learning. UniSTD unifies 4 disciplines with 10 tasks under one model and is trained on a massive collection of datasets.
  • Figure 2: Illustration of UniSTD. Our method supports unified and scalable spatiotemporal learning across diverse disciplines. To achieve this, we use a standard Transformer to serve as the backbone, allowing us to take advantage of the pretrained weights from large-scale task-agnostic pertaining. Furthermore, to better embed the domain-specific knowledge into the model, we design a rank-adaptive MoE mechanism that dynamically adjusts the sub-architectures of model according to the joint training process, and a lightweight temporal attention module to explicitly capture the temporal dynamics.
  • Figure 3: Weight updating patterns of task-specialized experts during optimization. We demonstrate the L1-norm of queries (Q), keys (K), and values (V) as a measurement for the optimal rank of each expert.
  • Figure 4: Visualization of the average rank across selected layers (Q, K, V and Proj) in MoEs.
  • Figure 5: Visualization of the prediction results using a shared model.