On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence

Gengchen Mai; Weiming Huang; Jin Sun; Suhang Song; Deepak Mishra; Ninghao Liu; Song Gao; Tianming Liu; Gao Cong; Yingjie Hu; Chris Cundy; Ziyuan Li; Rui Zhu; Ni Lao

On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence

Gengchen Mai, Weiming Huang, Jin Sun, Suhang Song, Deepak Mishra, Ninghao Liu, Song Gao, Tianming Liu, Gao Cong, Yingjie Hu, Chris Cundy, Ziyuan Li, Rui Zhu, Ni Lao

TL;DR

The paper investigates opportunities and challenges for foundation models in GeoAI, arguing that multimodality is the core difficulty due to text, images, trajectories, knowledge graphs, and vector data. It systematically benchmarks existing large language and vision-language models across seven geospatial tasks spanning geospatial semantics, health geography, urban geography, and remote sensing, revealing strong zero-shot/few-shot performance on language-only tasks but persistent gaps on multimodal tasks. It proposes a roadmap for a multimodal GeoAI foundation model that aligns diverse modalities via geospatial footprints and geospatial knowledge graphs to enable spatial reasoning, while highlighting risks such as geographic fidelity and biases and the need for vector-data encoding. The work underscores the importance of geospatial-aware pretraining, data alignment, and domain-specific evaluation to responsibly advance GeoAI foundations with real-world impact.

Abstract

Large pre-trained models, also known as foundation models (FMs), are trained in a task-agnostic manner on large-scale data and can be adapted to a wide range of downstream tasks by fine-tuning, few-shot, or even zero-shot learning. Despite their successes in language and vision tasks, we have yet seen an attempt to develop foundation models for geospatial artificial intelligence (GeoAI). In this work, we explore the promises and challenges of developing multimodal foundation models for GeoAI. We first investigate the potential of many existing FMs by testing their performances on seven tasks across multiple geospatial subdomains including Geospatial Semantics, Health Geography, Urban Geography, and Remote Sensing. Our results indicate that on several geospatial tasks that only involve text modality such as toponym recognition, location description recognition, and US state-level/county-level dementia time series forecasting, these task-agnostic LLMs can outperform task-specific fully-supervised models in a zero-shot or few-shot learning setting. However, on other geospatial tasks, especially tasks that involve multiple data modalities (e.g., POI-based urban function classification, street view image-based urban noise intensity classification, and remote sensing image scene classification), existing foundation models still underperform task-specific models. Based on these observations, we propose that one of the major challenges of developing a FM for GeoAI is to address the multimodality nature of geospatial tasks. After discussing the distinct challenges of each geospatial data modality, we suggest the possibility of a multimodal foundation model which can reason over various types of geospatial data through geospatial alignments. We conclude this paper by discussing the unique risks and challenges to develop such a model for GeoAI.

On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence

TL;DR

Abstract

Paper Structure (33 sections, 13 figures, 7 tables)

This paper contains 33 sections, 13 figures, 7 tables.

Introduction
Related Work
Language Foundation Model
Vision Foundation Model
Multimodal Foundation Model
Exploration of the Effectiveness of Existing FMs on Various Geospatial Domains
Geospatial Semantics
Toponym Recognition
Location Description Recognition
Health Geography
US State-Level Dementia Time Series Forecasting
US County-Level Dementia Time Series Forecasting
Urban Geography
POI-Based Urban Function Classification
Street View Image-Based Urban Noise Intensity Classification
...and 18 more sections

Figures (13)

Figure 1: Prediction error maps of each baseline and GPT model on US county-level dementia death count time series forecasting task. The color on each US count indicates the percentage error $PE = (Prediction - Label)/Label$ of each model prediction on this county. Those counties in gray color indicate their dementia data during 1999 and 2020 are not available.
Figure 2: The spatial distributions of POI data in the ${UrbanPOI5K}$ dataset.
Figure 3: Confusion matrices of Place2Vec and HGI (Group A in Table \ref{['tab:exp_urbanpoi_eval']}) on the ${UrbanPOI5K}$ dataset.
Figure 4: Confusion matrices of all GPT models (Group B in Table \ref{['tab:exp_urbanpoi_eval']}) on the ${UrbanPOI5K}$ dataset under zero-shot setting.
Figure 5: Confusion matrices of all GPT models (Group C in Table \ref{['tab:exp_urbanpoi_eval']}) on the ${UrbanPOI5K}$ dataset under the one-shot setting.
...and 8 more figures

On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence

TL;DR

Abstract

On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence

Authors

TL;DR

Abstract

Table of Contents

Figures (13)