Table of Contents
Fetching ...

GeoLink: Empowering Remote Sensing Foundation Model with OpenStreetMap Data

Lubian Bai, Xiuyuan Zhang, Siqi Zhang, Zepeng Zhang, Haoyu Wang, Wei Qin, Shihong Du

TL;DR

GeoLink tackles the challenge of integrating OpenStreetMap data with remote sensing foundation models by building a heterogeneous OSM graph and a cross-modal fusion pathway to produce both unimodal RS representations and multimodal hybrid encodings. It pretrains with multi-granularity signals from OSM, region-image contrastive alignment, and a masked-input scheme to accelerate learning. Downstream, GeoLink enhances RS interpretation tasks and enables comprehensive geographic tasks such as UFZ segmentation and UV identification, achieving state-of-the-art results and demonstrating robustness to incomplete OSM coverage. The work underscores the value of spatially aware multimodal fusion for geospatial intelligence and lays groundwork for extending to multispectral data and more expressive spatial encodings.

Abstract

Integrating ground-level geospatial data with rich geographic context, like OpenStreetMap (OSM), into remote sensing (RS) foundation models (FMs) is essential for advancing geospatial intelligence and supporting a broad spectrum of tasks. However, modality gap between RS and OSM data, including differences in data structure, content, and spatial granularity, makes effective synergy highly challenging, and most existing RS FMs focus on imagery alone. To this end, this study presents GeoLink, a multimodal framework that leverages OSM data to enhance RS FM during both the pretraining and downstream task stages. Specifically, GeoLink enhances RS self-supervised pretraining using multi-granularity learning signals derived from OSM data, guided by cross-modal spatial correlations for information interaction and collaboration. It also introduces image mask-reconstruction to enable sparse input for efficient pretraining. For downstream tasks, GeoLink generates both unimodal and multimodal fine-grained encodings to support a wide range of applications, from common RS interpretation tasks like land cover classification to more comprehensive geographic tasks like urban function zone mapping. Extensive experiments show that incorporating OSM data during pretraining enhances the performance of the RS image encoder, while fusing RS and OSM data in downstream tasks improves the FM's adaptability to complex geographic scenarios. These results underscore the potential of multimodal synergy in advancing high-level geospatial artificial intelligence. Moreover, we find that spatial correlation plays a crucial role in enabling effective multimodal geospatial data integration. Code, checkpoints, and using examples are released at https://github.com/bailubin/GeoLink_NeurIPS2025

GeoLink: Empowering Remote Sensing Foundation Model with OpenStreetMap Data

TL;DR

GeoLink tackles the challenge of integrating OpenStreetMap data with remote sensing foundation models by building a heterogeneous OSM graph and a cross-modal fusion pathway to produce both unimodal RS representations and multimodal hybrid encodings. It pretrains with multi-granularity signals from OSM, region-image contrastive alignment, and a masked-input scheme to accelerate learning. Downstream, GeoLink enhances RS interpretation tasks and enables comprehensive geographic tasks such as UFZ segmentation and UV identification, achieving state-of-the-art results and demonstrating robustness to incomplete OSM coverage. The work underscores the value of spatially aware multimodal fusion for geospatial intelligence and lays groundwork for extending to multispectral data and more expressive spatial encodings.

Abstract

Integrating ground-level geospatial data with rich geographic context, like OpenStreetMap (OSM), into remote sensing (RS) foundation models (FMs) is essential for advancing geospatial intelligence and supporting a broad spectrum of tasks. However, modality gap between RS and OSM data, including differences in data structure, content, and spatial granularity, makes effective synergy highly challenging, and most existing RS FMs focus on imagery alone. To this end, this study presents GeoLink, a multimodal framework that leverages OSM data to enhance RS FM during both the pretraining and downstream task stages. Specifically, GeoLink enhances RS self-supervised pretraining using multi-granularity learning signals derived from OSM data, guided by cross-modal spatial correlations for information interaction and collaboration. It also introduces image mask-reconstruction to enable sparse input for efficient pretraining. For downstream tasks, GeoLink generates both unimodal and multimodal fine-grained encodings to support a wide range of applications, from common RS interpretation tasks like land cover classification to more comprehensive geographic tasks like urban function zone mapping. Extensive experiments show that incorporating OSM data during pretraining enhances the performance of the RS image encoder, while fusing RS and OSM data in downstream tasks improves the FM's adaptability to complex geographic scenarios. These results underscore the potential of multimodal synergy in advancing high-level geospatial artificial intelligence. Moreover, we find that spatial correlation plays a crucial role in enabling effective multimodal geospatial data integration. Code, checkpoints, and using examples are released at https://github.com/bailubin/GeoLink_NeurIPS2025

Paper Structure

This paper contains 25 sections, 3 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: (a) OSM data stores the geometry information of geographic features in vector format, including points, polylines, and polygons, and leverages tags to record the semantic information. (b) GeoLink leverages multi-granularity SSL objectives to integrate RS and OSM data across multiple spatial scales, supporting both RS interpretation tasks and comprehensive geographic tasks.
  • Figure 2: (a) GeoLink masks both modalities, using visible image patches and masked OSM graph as inputs. Pretraining is achieved through three SSL objectives: RS reconstruction loss, cross-modal contrastive loss, and spatial consistency loss. (b) The heterogeneous graph is employed to model OSM data, incorporating three node types and multiple spatial relationships. (c) The pretrained model can produce both unimodal and multimodal encodings, generalizing to various downstream tasks.
  • Figure 3: (a) IoU (%) performances of each UFZ category. (b) T-SNE is used to visualize the learned patch encodings of GeoLink. With the incorporation of OSM data, multimodal encodings become more compact and discriminative than unimodal ones.
  • Figure 4: Detailed structure of the OSM encoder.
  • Figure 5: The detail structure of object-patch fusion encoder
  • ...and 4 more figures