Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment
Utkarsh Mall, Cheng Perng Phoo, Meilin Kelsey Liu, Carl Vondrick, Bharath Hariharan, Kavita Bala
TL;DR
This work tackles the lack of textual annotations in remote-sensing vision-language modeling by introducing GRAFT, a Ground Remote Alignment framework that connects satellite images to CLIP's ground-image space via co-located ground photos. It yields two large-scale RS VLMs (at 1m and 10m resolutions) trained without text, enabling zero-shot classification, retrieval, segmentation, and VQA with state-of-the-art performance. The approach leverages image-level and pixel-level alignments, collecting millions of ground-satellite pairs, and enhances capabilities with foundational models like SAM and ViperGPT. The results demonstrate substantial gains over supervised baselines and offer a practical, annotation-free path toward open-world RS understanding with broad scientific and applied impact.
Abstract
We introduce a method to train vision-language models for remote-sensing images without using any textual annotations. Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language. Specifically, we train an image encoder for remote sensing images to align with the image encoder of CLIP using a large amount of paired internet and satellite images. Our unsupervised approach enables the training of a first-of-its-kind large-scale vision language model (VLM) for remote sensing images at two different resolutions. We show that these VLMs enable zero-shot, open-vocabulary image classification, retrieval, segmentation and visual question answering for satellite images. On each of these tasks, our VLM trained without textual annotations outperforms existing VLMs trained with supervision, with gains of up to 20% for classification and 80% for segmentation.
