Table of Contents
Fetching ...

UrbanFM: Scaling Urban Spatio-Temporal Foundation Models

Wei Chen, Yuqian Wu, Junle Chen, Xiaofang Zhou, Yuxuan Liang

TL;DR

This work introduces the MiniST unit, a novel split mechanism that discretizes continuous spatio-temporal fields into learnable computational units to unify representations of grid-based and sensor-based observations, and proposes UrbanFM, a minimalist self-attention architecture designed to autonomously learn dynamic spatio-temporal dependencies from massive data.

Abstract

Urban systems, as dynamic complex systems, continuously generate spatio-temporal data streams that encode the fundamental laws of human mobility and city evolution. While AI for Science has witnessed the transformative power of foundation models in disciplines like genomics and meteorology, urban computing remains fragmented due to "scenario-specific" models, which are overfitted to specific regions or tasks, hindering their generalizability. To bridge this gap and advance spatio-temporal foundation models for urban systems, we adopt scaling as the central perspective and systematically investigate two key questions: what to scale and how to scale. Grounded in first-principles analysis, we identify three critical dimensions: heterogeneity, correlation, and dynamics, aligning these principles with the fundamental scientific properties of urban spatio-temporal data. Specifically, to address heterogeneity through data scaling, we construct WorldST. This billion-scale corpus standardizes diverse physical signals, such as traffic flow and speed, from over 100 global cities into a unified data format. To enable computation scaling for modeling correlations, we introduce the MiniST unit, a novel split mechanism that discretizes continuous spatio-temporal fields into learnable computational units to unify representations of grid-based and sensor-based observations. Finally, addressing dynamics via architecture scaling, we propose UrbanFM, a minimalist self-attention architecture designed with limited inductive biases to autonomously learn dynamic spatio-temporal dependencies from massive data. Furthermore, we establish EvalST, the largest-scale urban spatio-temporal benchmark to date. Extensive experiments demonstrate that UrbanFM achieves remarkable zero-shot generalization across unseen cities and tasks, marking a pivotal first step toward large-scale urban spatio-temporal foundation models.

UrbanFM: Scaling Urban Spatio-Temporal Foundation Models

TL;DR

This work introduces the MiniST unit, a novel split mechanism that discretizes continuous spatio-temporal fields into learnable computational units to unify representations of grid-based and sensor-based observations, and proposes UrbanFM, a minimalist self-attention architecture designed to autonomously learn dynamic spatio-temporal dependencies from massive data.

Abstract

Urban systems, as dynamic complex systems, continuously generate spatio-temporal data streams that encode the fundamental laws of human mobility and city evolution. While AI for Science has witnessed the transformative power of foundation models in disciplines like genomics and meteorology, urban computing remains fragmented due to "scenario-specific" models, which are overfitted to specific regions or tasks, hindering their generalizability. To bridge this gap and advance spatio-temporal foundation models for urban systems, we adopt scaling as the central perspective and systematically investigate two key questions: what to scale and how to scale. Grounded in first-principles analysis, we identify three critical dimensions: heterogeneity, correlation, and dynamics, aligning these principles with the fundamental scientific properties of urban spatio-temporal data. Specifically, to address heterogeneity through data scaling, we construct WorldST. This billion-scale corpus standardizes diverse physical signals, such as traffic flow and speed, from over 100 global cities into a unified data format. To enable computation scaling for modeling correlations, we introduce the MiniST unit, a novel split mechanism that discretizes continuous spatio-temporal fields into learnable computational units to unify representations of grid-based and sensor-based observations. Finally, addressing dynamics via architecture scaling, we propose UrbanFM, a minimalist self-attention architecture designed with limited inductive biases to autonomously learn dynamic spatio-temporal dependencies from massive data. Furthermore, we establish EvalST, the largest-scale urban spatio-temporal benchmark to date. Extensive experiments demonstrate that UrbanFM achieves remarkable zero-shot generalization across unseen cities and tasks, marking a pivotal first step toward large-scale urban spatio-temporal foundation models.
Paper Structure (65 sections, 3 equations, 14 figures, 9 tables, 1 algorithm)

This paper contains 65 sections, 3 equations, 14 figures, 9 tables, 1 algorithm.

Figures (14)

  • Figure 1: Top: The fundamental nature of spatio-temporal data and challenges of urban ST foundational models. Bottom: Three core perspectives of scaling the urban ST foundation model: data, computation, and architecture.
  • Figure 2: Data Overview: Breakthroughs in multi-domain coverage and spatio-temporal scale. Notably, we far surpass UniST yuan2024unist, OpenCity li2024opencity, and BigCity yu2025bigcity in terms of spatial regions and temporal spans, by as much as 33 to 145 times.
  • Figure 3: A schematic diagram of the scaling mechanism: MiniST tokenization and UrbanFM model.
  • Figure 4: Zero-shot forecasting effectiveness on various spatial-temporal benchmarks (Full results in Table \ref{['tab:zero_few_grid']}, \ref{['tab:zero_few_graph']}, \ref{['tab:few_full_graph']}, and \ref{['tab:few_full_grid']}).
  • Figure 5: Evaluation of UrbanFM's performance gain through few-shot tuning and comparison with full-shot expert models.
  • ...and 9 more figures