Table of Contents
Fetching ...

Envisioning global urban development with satellite imagery and generative AI

Kailai Sun, Yuebing Liang, Mingyi He, Yunhan Zheng, Alok Prakash, Shenhao Wang, Jinhua Zhao, Alex "Sandy'' Pentland

Abstract

Urban development has been a defining force in human history, shaping cities for centuries. However, past studies mostly analyze such development as predictive tasks, failing to reflect its generative nature. Therefore, this study designs a multimodal generative AI framework to envision sustainable urban development at a global scale. By integrating prompts and geospatial controls, our framework can generate high-fidelity, diverse, and realistic urban satellite imagery across the 500 largest metropolitan areas worldwide. It enables users to specify urban development goals, creating new images that align with them while offering diverse scenarios whose appearance can be controlled with text prompts and geospatial constraints. It also facilitates urban redevelopment practices by learning from the surrounding environment. Beyond visual synthesis, we find that it encodes and interprets latent representations of urban form for global cross-city learning, successfully transferring styles of urban environments across a global spatial network. The latent representations can also enhance downstream prediction tasks such as carbon emission prediction. Further, human expert evaluation confirms that our generated urban images are comparable to real urban images. Overall, this study presents innovative approaches for accelerated urban planning and supports scenario-based planning processes for worldwide cities.

Envisioning global urban development with satellite imagery and generative AI

Abstract

Urban development has been a defining force in human history, shaping cities for centuries. However, past studies mostly analyze such development as predictive tasks, failing to reflect its generative nature. Therefore, this study designs a multimodal generative AI framework to envision sustainable urban development at a global scale. By integrating prompts and geospatial controls, our framework can generate high-fidelity, diverse, and realistic urban satellite imagery across the 500 largest metropolitan areas worldwide. It enables users to specify urban development goals, creating new images that align with them while offering diverse scenarios whose appearance can be controlled with text prompts and geospatial constraints. It also facilitates urban redevelopment practices by learning from the surrounding environment. Beyond visual synthesis, we find that it encodes and interprets latent representations of urban form for global cross-city learning, successfully transferring styles of urban environments across a global spatial network. The latent representations can also enhance downstream prediction tasks such as carbon emission prediction. Further, human expert evaluation confirms that our generated urban images are comparable to real urban images. Overall, this study presents innovative approaches for accelerated urban planning and supports scenario-based planning processes for worldwide cities.

Paper Structure

This paper contains 18 sections, 9 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Spatial heterogeneity of population density and building volume density for global 500 cities.a. City-level global population density. b. Distribution of population density across 500 cities, sorted in ascending order. c. City-level global building volume density. d. Distribution of building volume density across 500 cities, sorted in ascending order. In a and c, the city-level densities are divided into five qualitative levels, where warm (orange and red) and cool colors (green and slight green) refer to high and low densities, respectively. In b and d, each dot corresponds to a city and is color-coded by continent (Europe, Asia, North America, South America, Oceania, and Africa). Representative cities, including Boston, Stockholm, Kigali, Singapore, and Hong Kong, are highlighted for reference.
  • Figure 2: Multi-modal conditional counterfactual synthesis of urban satellite imagery.a. Generated images respond to counterfactual template texts. The first row shows three generated images from the density metric prompt, and the second row shows three generated images from the land use prompt. b. Generated images respond to free-form texts. c. Urban redevelopment with the image inpaint model. d. Generative fidelity, precision and diversity performance of our framework.
  • Figure 3: Cross-city learning through generative urban visualization.a. Generated satellite images across cities from the source city (e.g., Hong Kong, Lusaka). b. Performance variation across regions. c. Human evaluation of our framework by urban experts. 42 urban experts (from the United States, Singapore, China, etc.; from both academia and industry) were recruited to rate (1-10 scores) for 136 questions. The generated urban images are comparable to those of real urban images.
  • Figure 4: Our framework quantifies urban visuals in latent space.a. Our framework has learned a distinguishable visual representation for each city. other city's results can be found in SI. b. Cross-city transfer via prompt modification from “Stockholm” to “Stockholm from Kigali” yields hybrid satellite imagery, with latent features (black points) located between Kigali and Stockholm clusters. c. Performance comparison of predicted versus true global fossil fuel carbon emission for ResNet and DINOv3, using original data (top) and generative augmentation (bottom).
  • Figure : Extended Data Fig. 1 Modeling diagram of our framework.
  • ...and 9 more figures