Table of Contents
Fetching ...

GenAI Models Capture Urban Science but Oversimplify Complexity

Yecheng Zhang, Rong Zhao, Zimu Huang, Xinyu Wang, Yue Ma, Ying Long

TL;DR

AI4US introduces a Generate-Evaluate-Calibrate workflow to test GenAI's utility as a virtual laboratory for urban science, assessing both symbolic (theory-driven) and perceptual (scene-based) data. The study finds that GenAI can reproduce core urban patterns, such as scaling and center-periphery decay, and can mimic qualitative vitality, yet the synthetic outputs tend to be homogenous and biased in parameterization, creating Mirage cities. A post-hoc calibration based on Optimal Transport significantly improves distributional fidelity and, together with targeted prompting, brings outputs closer to empirical data, though GenAI is not yet a true world model. The work outlines a practical hybrid research path where generative priors enable rapid theory exploration and hypothesis generation, while calibration and causal urban simulators provide the depth and validity needed for robust urban science.

Abstract

Generative artificial intelligence (GenAI) models are increasingly used for scientific data generation, yet their alignment with empirical knowledge in urban science remains unclear. Here, we introduce AI4US (Artificial Intelligence for Urban Science), a framework that systematically evaluates leading GenAI models by testing their fidelity in generating both symbolic and perceptual urban data. For the symbolic domain, we benchmark generated data against foundational urban theories concerning scale, space, and morphology. For the perceptual domain, we validate the models' visual judgments against human benchmarks and, critically, leverage their generative control to conduct in causal experiments on urban perception. Our findings show that while GenAI models reproduce core theoretical patterns, the generated data exhibit crucial limitations: poor diversity, systematic parametric deviations, and improvement from prompt engineering. To address this, we introduce a post-hoc calibration procedure using optimal transport, which produces synthetic symbolic datasets with demonstrably higher fidelity.

GenAI Models Capture Urban Science but Oversimplify Complexity

TL;DR

AI4US introduces a Generate-Evaluate-Calibrate workflow to test GenAI's utility as a virtual laboratory for urban science, assessing both symbolic (theory-driven) and perceptual (scene-based) data. The study finds that GenAI can reproduce core urban patterns, such as scaling and center-periphery decay, and can mimic qualitative vitality, yet the synthetic outputs tend to be homogenous and biased in parameterization, creating Mirage cities. A post-hoc calibration based on Optimal Transport significantly improves distributional fidelity and, together with targeted prompting, brings outputs closer to empirical data, though GenAI is not yet a true world model. The work outlines a practical hybrid research path where generative priors enable rapid theory exploration and hypothesis generation, while calibration and causal urban simulators provide the depth and validity needed for robust urban science.

Abstract

Generative artificial intelligence (GenAI) models are increasingly used for scientific data generation, yet their alignment with empirical knowledge in urban science remains unclear. Here, we introduce AI4US (Artificial Intelligence for Urban Science), a framework that systematically evaluates leading GenAI models by testing their fidelity in generating both symbolic and perceptual urban data. For the symbolic domain, we benchmark generated data against foundational urban theories concerning scale, space, and morphology. For the perceptual domain, we validate the models' visual judgments against human benchmarks and, critically, leverage their generative control to conduct in causal experiments on urban perception. Our findings show that while GenAI models reproduce core theoretical patterns, the generated data exhibit crucial limitations: poor diversity, systematic parametric deviations, and improvement from prompt engineering. To address this, we introduce a post-hoc calibration procedure using optimal transport, which produces synthetic symbolic datasets with demonstrably higher fidelity.

Paper Structure

This paper contains 16 sections, 13 equations, 20 figures, 3 tables.

Figures (20)

  • Figure 1: Urban scaling law reproduced on generated data with GenAI models: generated versus empirical coefficients. 'Replicate Tests' contains 100 independent duplicate data generation experiments for each GenAI model, and each experiment generates data of 100 cities, among which GPT-4o has best performance (See Supplementary Experiment 6). The diagram shows the results of GPT-4o.
  • Figure 2: Urban decay reproduced on generated data with GenAI models: generated versus empirical statistical and spatial patterns. a, Concentric-ring land-density profiles for three representative cities on different continents across three decades (1990-2010), together with the corresponding 'Mirage cities' produced by GPT-4o when both city name and time period are included in the prompt. b, Sensitivity of the generated decay curves to prompts that specify different maximum radii and historical periods. c, Direct side-by-side comparison of Beijing's empirical profile with that of its mirage city generated by GenAI models.
  • Figure 3: Urban vitality revealed on generated data with GenAI models. 'Replicate Tests' contains 100 independent duplicate data generation experiments for the GenAI models, and each experiment generates data of 100 blocks. The diagram shows the results of GPT-4o.
  • Figure 4: Potential of GenAI models in qualitative theory exploration in visual urban space. a, AI perception aligns with human judgment. A confusion matrix of pairwise choices (left) and per-category agreement rates with Cohen’s Kappa scores (right). b, Identifying influential visual elements. Standardized beta coefficients ($\beta$) from regression models reveal the key visual features that are positively or negatively correlated with each of the six perceptual scores. c, Causal inference via thematic interventions. The average change in perception scores ($\Delta$ Score) quantifies the causal impact of programmatically adding thematic element categories to scenes: Natural Elements (trees, grass, sky, water, bushes), Traffic Elements (cars, trucks, bridges, roads), and Built Elements (buildings, fences, walls). Error bars represent 95% confidence intervals.
  • Figure 5: Data distribution divergence analysis. a, Equal-width binning with relative frequency normalization was applied to convert raw sample counts into scale-invariant frequency distributions. Mean Absolute Error (MAE) of bin-level frequencies quantified the central tendency divergence between real data (R) and generated data (G), calculated as the sum of absolute differences across all 15 bins. Overlap Ratio (OR) measured quantile interval congruence between corresponding bins of R and G distributions. b, Jensen-Shannon Divergence (JSD) comparisons among four datasets: real data (R), baseline generated data (G), China-prompted generated data (GC), and USA-prompted generated data (GA). The inclusion of geographical constraints, such as 'China', increased GPT-4o's constraint-triggering rate by 24% (24/100 trials) (e.g., 'I can't provide real-time data such as the current population').
  • ...and 15 more figures