Table of Contents
Fetching ...

Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images

Aayush Dhakal, Adeel Ahmad, Subash Khanal, Srikumar Sastry, Hannah Kerner, Nathan Jacobs

TL;DR

Sat2Cap tackles the challenge of generating fine-grained, ground-level textual descriptions from satellite imagery in a zero-shot, weakly supervised setting. It learns a cross-view embedding space by predicting ground-level CLIP embeddings from overhead images and conditioning on temporal metadata, using a large 6.1M cross-view dataset and a memory-queue contrastive objective. The approach yields richer, temporally dynamic representations than CLIP baselines, enabling zero-shot maps for fine-grained prompts and dynamic captioning that align with ground-truth descriptions, while maintaining scalability without text labels. This framework offers a flexible, scalable pathway to semantic geospatial mapping and localization, with practical implications for large-scale textual querying and map generation.

Abstract

We propose a weakly supervised approach for creating maps using free-form textual descriptions. We refer to this work of creating textual maps as zero-shot mapping. Prior works have approached mapping tasks by developing models that predict a fixed set of attributes using overhead imagery. However, these models are very restrictive as they can only solve highly specific tasks for which they were trained. Mapping text, on the other hand, allows us to solve a large variety of mapping problems with minimal restrictions. To achieve this, we train a contrastive learning framework called Sat2Cap on a new large-scale dataset with 6.1M pairs of overhead and ground-level images. For a given location and overhead image, our model predicts the expected CLIP embeddings of the ground-level scenery. The predicted CLIP embeddings are then used to learn about the textual space associated with that location. Sat2Cap is also conditioned on date-time information, allowing it to model temporally varying concepts over a location. Our experimental results demonstrate that our models successfully capture ground-level concepts and allow large-scale mapping of fine-grained textual queries. Our approach does not require any text-labeled data, making the training easily scalable. The code, dataset, and models will be made publicly available.

Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images

TL;DR

Sat2Cap tackles the challenge of generating fine-grained, ground-level textual descriptions from satellite imagery in a zero-shot, weakly supervised setting. It learns a cross-view embedding space by predicting ground-level CLIP embeddings from overhead images and conditioning on temporal metadata, using a large 6.1M cross-view dataset and a memory-queue contrastive objective. The approach yields richer, temporally dynamic representations than CLIP baselines, enabling zero-shot maps for fine-grained prompts and dynamic captioning that align with ground-truth descriptions, while maintaining scalability without text labels. This framework offers a flexible, scalable pathway to semantic geospatial mapping and localization, with practical implications for large-scale textual querying and map generation.

Abstract

We propose a weakly supervised approach for creating maps using free-form textual descriptions. We refer to this work of creating textual maps as zero-shot mapping. Prior works have approached mapping tasks by developing models that predict a fixed set of attributes using overhead imagery. However, these models are very restrictive as they can only solve highly specific tasks for which they were trained. Mapping text, on the other hand, allows us to solve a large variety of mapping problems with minimal restrictions. To achieve this, we train a contrastive learning framework called Sat2Cap on a new large-scale dataset with 6.1M pairs of overhead and ground-level images. For a given location and overhead image, our model predicts the expected CLIP embeddings of the ground-level scenery. The predicted CLIP embeddings are then used to learn about the textual space associated with that location. Sat2Cap is also conditioned on date-time information, allowing it to model temporally varying concepts over a location. Our experimental results demonstrate that our models successfully capture ground-level concepts and allow large-scale mapping of fine-grained textual queries. Our approach does not require any text-labeled data, making the training easily scalable. The code, dataset, and models will be made publicly available.
Paper Structure (20 sections, 5 equations, 14 figures, 2 tables)

This paper contains 20 sections, 5 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Country-level maps of textual descriptions: (Col 1-2) shows the country-level maps created using Sat2Cap for three prompts: "Cars stuck in traffic", "People fishing on a boat," and "Farmers harvesting crops." We compare the predicted zero-shot maps with landcover maps of the region.
  • Figure 2: Sat2Cap Framework: The frozen CLIP Image Encoder takes as input the ground-level images and generates their CLIP embeddings. The trainable Sat2Cap Image Encoder takes as input the overhead images, and the Dynamic Encoder takes as input the date, time, and location information. The respective overhead image embeddings and meta-information embeddings are added element-wise, and the resulting embeddings are contrastively trained with the CLIP embeddings of the ground-level images.
  • Figure 3: Top-9 overhead-to-ground image retrieval: We use the Sat2Cap embeddings of the overhead images and CLIP embeddings of the ground-level images and show the 9 closest ground-level images retrieved for a query overhead image. The retrieval was performed from a gallery of 10,000 samples.
  • Figure 4: Silhouette value of CLIP embedding vs. Sat2Cap embedding clusters: We use k-means clustering with identical parameters to get the clusters for different values of k. Results show that Sat2Cap embeddings can be well separated into a larger number of clusters than the corresponding CLIP embeddings.
  • Figure 5: Fine-grained maps CLIP vs Sat2Cap: We create zero-shot maps on a city level using CLIP and Sat2Cap for two prompts: "Heavy trucks transporting goods" and "People waking in the streets of downtown." Compared to CLIP, Sat2Cap activations are more localized to the appropriate regions for a given prompt. Sat2Cap is better at distinguishing between fine-grained concepts like highway and downtown street.
  • ...and 9 more figures