Table of Contents
Fetching ...

SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery

Konstantin Klemmer, Esther Rolf, Caleb Robinson, Lester Mackey, Marc Rußwurm

TL;DR

SatCLIP addresses the need for globally robust location representations from satellite data by training a contrastive pretraining objective that aligns coordinates with multi-spectral Sentinel-2 imagery in a compact $d=256$-dimensional embedding space. The framework uses spherical-harmonic location encodings with Sinusoidal Representation Networks and leverages a frozen vision encoder (ViT16/ResNet) to learn, with a CLIP-like objective, a shared embedding space for coordinates and images. The authors release the S2-100K dataset and pretrained weights, and demonstrate improved performance and geographic generalization across nine diverse location-dependent tasks, including zero-/few-shot adaptations. This approach provides a scalable, globally representative, and efficient means to incorporate ground conditions into geospatial models, with potential extensions to multiple modalities and time.

Abstract

Geographic information is essential for modeling tasks in fields ranging from ecology to epidemiology. However, extracting relevant location characteristics for a given task can be challenging, often requiring expensive data fusion or distillation from massive global imagery datasets. To address this challenge, we introduce Satellite Contrastive Location-Image Pretraining (SatCLIP). This global, general-purpose geographic location encoder learns an implicit representation of locations by matching CNN and ViT inferred visual patterns of openly available satellite imagery with their geographic coordinates. The resulting SatCLIP location encoder efficiently summarizes the characteristics of any given location for convenient use in downstream tasks. In our experiments, we use SatCLIP embeddings to improve prediction performance on nine diverse location-dependent tasks including temperature prediction, animal recognition, and population density estimation. Across tasks, SatCLIP consistently outperforms alternative location encoders and improves geographic generalization by encoding visual similarities of spatially distant environments. These results demonstrate the potential of vision-location models to learn meaningful representations of our planet from the vast, varied, and largely untapped modalities of geospatial data.

SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery

TL;DR

SatCLIP addresses the need for globally robust location representations from satellite data by training a contrastive pretraining objective that aligns coordinates with multi-spectral Sentinel-2 imagery in a compact -dimensional embedding space. The framework uses spherical-harmonic location encodings with Sinusoidal Representation Networks and leverages a frozen vision encoder (ViT16/ResNet) to learn, with a CLIP-like objective, a shared embedding space for coordinates and images. The authors release the S2-100K dataset and pretrained weights, and demonstrate improved performance and geographic generalization across nine diverse location-dependent tasks, including zero-/few-shot adaptations. This approach provides a scalable, globally representative, and efficient means to incorporate ground conditions into geospatial models, with potential extensions to multiple modalities and time.

Abstract

Geographic information is essential for modeling tasks in fields ranging from ecology to epidemiology. However, extracting relevant location characteristics for a given task can be challenging, often requiring expensive data fusion or distillation from massive global imagery datasets. To address this challenge, we introduce Satellite Contrastive Location-Image Pretraining (SatCLIP). This global, general-purpose geographic location encoder learns an implicit representation of locations by matching CNN and ViT inferred visual patterns of openly available satellite imagery with their geographic coordinates. The resulting SatCLIP location encoder efficiently summarizes the characteristics of any given location for convenient use in downstream tasks. In our experiments, we use SatCLIP embeddings to improve prediction performance on nine diverse location-dependent tasks including temperature prediction, animal recognition, and population density estimation. Across tasks, SatCLIP consistently outperforms alternative location encoders and improves geographic generalization by encoding visual similarities of spatially distant environments. These results demonstrate the potential of vision-location models to learn meaningful representations of our planet from the vast, varied, and largely untapped modalities of geospatial data.
Paper Structure (39 sections, 2 equations, 13 figures, 9 tables)

This paper contains 39 sections, 2 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Motivation for SatCLIP: Capturing ground conditions from satellite images and transferring them into a location encoder via constrastive image-location pretaining. The right globe shows a PCA representation of the pre-trained location encoder.
  • Figure 2: The SatCLIP pretraining and deployment pipeline. SatCLIP pretraining through image-location matching is outlined on the left. The pretrained location encoder can then be used in downstream tasks, highlighted on the right.
  • Figure 3: Spatial distribution of the S2-100K dataset used for training SatCLIP compared with iNaturalist 2018 Horn2018 and MP-16 Larson2017, which are used to pretrain CSP and GeoCLIP models. iNaturalist and MP-16 heavily overrepresent North America and Europe.
  • Figure 4: Performance metrics aggregated by continent highlight how location embeddings perform in different geographic areas for population density estimation and biome classification for five continents. $L=40$ SatCLIP models are shown.
  • Figure 5: Geographic adaptation: predictions of Ecoregions for Africa. SatCLIP with $L=10$ maps Ecoregions in Africa closest to the ground truth, followed by GeoCLIP. MOSAIKS provides predictions that are too fine-grained, and CSP-iNat is too coarse. "X" marked locations in the "True" panel show the sparse training locations in Africa, which are on average 480km apart from their nearest neighbor.
  • ...and 8 more figures