SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery
Konstantin Klemmer, Esther Rolf, Caleb Robinson, Lester Mackey, Marc Rußwurm
TL;DR
SatCLIP addresses the need for globally robust location representations from satellite data by training a contrastive pretraining objective that aligns coordinates with multi-spectral Sentinel-2 imagery in a compact $d=256$-dimensional embedding space. The framework uses spherical-harmonic location encodings with Sinusoidal Representation Networks and leverages a frozen vision encoder (ViT16/ResNet) to learn, with a CLIP-like objective, a shared embedding space for coordinates and images. The authors release the S2-100K dataset and pretrained weights, and demonstrate improved performance and geographic generalization across nine diverse location-dependent tasks, including zero-/few-shot adaptations. This approach provides a scalable, globally representative, and efficient means to incorporate ground conditions into geospatial models, with potential extensions to multiple modalities and time.
Abstract
Geographic information is essential for modeling tasks in fields ranging from ecology to epidemiology. However, extracting relevant location characteristics for a given task can be challenging, often requiring expensive data fusion or distillation from massive global imagery datasets. To address this challenge, we introduce Satellite Contrastive Location-Image Pretraining (SatCLIP). This global, general-purpose geographic location encoder learns an implicit representation of locations by matching CNN and ViT inferred visual patterns of openly available satellite imagery with their geographic coordinates. The resulting SatCLIP location encoder efficiently summarizes the characteristics of any given location for convenient use in downstream tasks. In our experiments, we use SatCLIP embeddings to improve prediction performance on nine diverse location-dependent tasks including temperature prediction, animal recognition, and population density estimation. Across tasks, SatCLIP consistently outperforms alternative location encoders and improves geographic generalization by encoding visual similarities of spatially distant environments. These results demonstrate the potential of vision-location models to learn meaningful representations of our planet from the vast, varied, and largely untapped modalities of geospatial data.
