Table of Contents
Fetching ...

Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment

Utkarsh Mall, Cheng Perng Phoo, Meilin Kelsey Liu, Carl Vondrick, Bharath Hariharan, Kavita Bala

TL;DR

This work tackles the lack of textual annotations in remote-sensing vision-language modeling by introducing GRAFT, a Ground Remote Alignment framework that connects satellite images to CLIP's ground-image space via co-located ground photos. It yields two large-scale RS VLMs (at 1m and 10m resolutions) trained without text, enabling zero-shot classification, retrieval, segmentation, and VQA with state-of-the-art performance. The approach leverages image-level and pixel-level alignments, collecting millions of ground-satellite pairs, and enhances capabilities with foundational models like SAM and ViperGPT. The results demonstrate substantial gains over supervised baselines and offer a practical, annotation-free path toward open-world RS understanding with broad scientific and applied impact.

Abstract

We introduce a method to train vision-language models for remote-sensing images without using any textual annotations. Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language. Specifically, we train an image encoder for remote sensing images to align with the image encoder of CLIP using a large amount of paired internet and satellite images. Our unsupervised approach enables the training of a first-of-its-kind large-scale vision language model (VLM) for remote sensing images at two different resolutions. We show that these VLMs enable zero-shot, open-vocabulary image classification, retrieval, segmentation and visual question answering for satellite images. On each of these tasks, our VLM trained without textual annotations outperforms existing VLMs trained with supervision, with gains of up to 20% for classification and 80% for segmentation.

Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment

TL;DR

This work tackles the lack of textual annotations in remote-sensing vision-language modeling by introducing GRAFT, a Ground Remote Alignment framework that connects satellite images to CLIP's ground-image space via co-located ground photos. It yields two large-scale RS VLMs (at 1m and 10m resolutions) trained without text, enabling zero-shot classification, retrieval, segmentation, and VQA with state-of-the-art performance. The approach leverages image-level and pixel-level alignments, collecting millions of ground-satellite pairs, and enhances capabilities with foundational models like SAM and ViperGPT. The results demonstrate substantial gains over supervised baselines and offer a practical, annotation-free path toward open-world RS understanding with broad scientific and applied impact.

Abstract

We introduce a method to train vision-language models for remote-sensing images without using any textual annotations. Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language. Specifically, we train an image encoder for remote sensing images to align with the image encoder of CLIP using a large amount of paired internet and satellite images. Our unsupervised approach enables the training of a first-of-its-kind large-scale vision language model (VLM) for remote sensing images at two different resolutions. We show that these VLMs enable zero-shot, open-vocabulary image classification, retrieval, segmentation and visual question answering for satellite images. On each of these tasks, our VLM trained without textual annotations outperforms existing VLMs trained with supervision, with gains of up to 20% for classification and 80% for segmentation.
Paper Structure (46 sections, 5 equations, 10 figures, 5 tables)

This paper contains 46 sections, 5 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Zero-shot features of our model. GRAFT can perform image retrieval with open-world queries, and zero-shot classification for satellite images. Using other foundational models, we extend it to also perform semantic segmentation and zero-shot VQA. (Please view digitally to see details.)
  • Figure 2: Training image-level VLM (left) and pixel-level VLM (right) with GRAFT. Note that for each satellite image there can be multiple ground images such as $\{g_1^1, g_1^2\}$ for $s_1$.
  • Figure 3: Frequency histogram of locations of samples in our internet image-NAIP image pair dataset (left) and in our internet image-Sentinel-2 image pair dataset (right).
  • Figure 4: Density maps produced using open-world queries for cities, roads, and farmlands using our method (darker blue means higher density). The right-most map shows the true agricultural land use pattern. Our map matches with the ground truth.
  • Figure 5: Top retrievals of GRAFT when finding dynamic objects such as cargo ships or airplanes. Images with green box show a successful retrieval of dynamic objects.
  • ...and 5 more figures