Table of Contents
Fetching ...

Diffusion Transformers as Open-World Spatiotemporal Foundation Models

Yuan Yuan, Chonghua Han, Jingtao Ding, Guozhen Zhang, Depeng Jin, Yong Li

TL;DR

UrbanDiT introduces an open-world foundation model for urban spatio-temporal learning by marrying diffusion transformers with a unified prompt-learning framework. It unifies grid- and graph-based data into a sequential input and supports multiple tasks via data-driven and task-specific prompts, enabling strong zero-shot generalization across cities. Empirical results show state-of-the-art performance on diverse forward, backward, interpolation, extrapolation, and imputation tasks, with robust few-shot and zero-shot capabilities and scalable behavior as data size increases. This work advances practical open-world urban modeling by providing a single, extensible model and releasing data/code to support broad adoption and further research.

Abstract

The urban environment is characterized by complex spatio-temporal dynamics arising from diverse human activities and interactions. Effectively modeling these dynamics is essential for understanding and optimizing urban systems. In this work, we introduce UrbanDiT, a foundation model for open-world urban spatio-temporal learning that successfully scales up diffusion transformers in this field. UrbanDiT pioneers a unified model that integrates diverse data sources and types while learning universal spatio-temporal patterns across different cities and scenarios. This allows the model to unify both multi-data and multi-task learning, and effectively support a wide range of spatio-temporal applications. Its key innovation lies in the elaborated prompt learning framework, which adaptively generates both data-driven and task-specific prompts, guiding the model to deliver superior performance across various urban applications. UrbanDiT offers three advantages: 1) It unifies diverse data types, such as grid-based and graph-based data, into a sequential format; 2) With task-specific prompts, it supports a wide range of tasks, including bi-directional spatio-temporal prediction, temporal interpolation, spatial extrapolation, and spatio-temporal imputation; and 3) It generalizes effectively to open-world scenarios, with its powerful zero-shot capabilities outperforming nearly all baselines with training data. UrbanDiT sets up a new benchmark for foundation models in the urban spatio-temporal domain. Code and datasets are publicly available at https://github.com/tsinghua-fib-lab/UrbanDiT.

Diffusion Transformers as Open-World Spatiotemporal Foundation Models

TL;DR

UrbanDiT introduces an open-world foundation model for urban spatio-temporal learning by marrying diffusion transformers with a unified prompt-learning framework. It unifies grid- and graph-based data into a sequential input and supports multiple tasks via data-driven and task-specific prompts, enabling strong zero-shot generalization across cities. Empirical results show state-of-the-art performance on diverse forward, backward, interpolation, extrapolation, and imputation tasks, with robust few-shot and zero-shot capabilities and scalable behavior as data size increases. This work advances practical open-world urban modeling by providing a single, extensible model and releasing data/code to support broad adoption and further research.

Abstract

The urban environment is characterized by complex spatio-temporal dynamics arising from diverse human activities and interactions. Effectively modeling these dynamics is essential for understanding and optimizing urban systems. In this work, we introduce UrbanDiT, a foundation model for open-world urban spatio-temporal learning that successfully scales up diffusion transformers in this field. UrbanDiT pioneers a unified model that integrates diverse data sources and types while learning universal spatio-temporal patterns across different cities and scenarios. This allows the model to unify both multi-data and multi-task learning, and effectively support a wide range of spatio-temporal applications. Its key innovation lies in the elaborated prompt learning framework, which adaptively generates both data-driven and task-specific prompts, guiding the model to deliver superior performance across various urban applications. UrbanDiT offers three advantages: 1) It unifies diverse data types, such as grid-based and graph-based data, into a sequential format; 2) With task-specific prompts, it supports a wide range of tasks, including bi-directional spatio-temporal prediction, temporal interpolation, spatial extrapolation, and spatio-temporal imputation; and 3) It generalizes effectively to open-world scenarios, with its powerful zero-shot capabilities outperforming nearly all baselines with training data. UrbanDiT sets up a new benchmark for foundation models in the urban spatio-temporal domain. Code and datasets are publicly available at https://github.com/tsinghua-fib-lab/UrbanDiT.

Paper Structure

This paper contains 26 sections, 7 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: A diagram of UrbanDiT: a foundation model that integrates diverse data sources while addressing multiple tasks.
  • Figure 2: Illustration of the whole framework of UrbanDiT, including four key components: a) Unifying different urban spatio-temporal data types; b) The diffusion pipeline of our UrbanDiT; c) Different masking strategies to specify different tasks; d) Unified prompt learning with data-driven and task-specific prompts to enhance the denoising process.
  • Figure 3: Structure of memory pools.
  • Figure 4: Evaluation of UrbanDiT and baseline models in 5% and 1% few-shot scenarios on the PopSH dataset. The red dashed line indicates UrbanDiT's zero-shot performance
  • Figure 5: Ablation study on the prompt design using RMSE on the TaxiBJ dataset.
  • ...and 3 more figures