GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis
Angelos Zavras, Dimitrios Michail, Xiao Xiang Zhu, Begüm Demir, Ioannis Papoutsis
TL;DR
GAIA tackles the lack of high-quality, domain-specific image-text data for remote sensing vision-language models by creating a large, global RS dataset with 40,201 images and five synthetic captions per image (201,005 image-text pairs) collected over 25 years. The dataset combines targeted web-scraping from reputable RS sources with GPT-4o-based generation of scientifically grounded captions and rich metadata, enabling robust multi-modal learning across diverse RS modalities and resolutions. Through fine-tuning CLIP and BLIP-2 on GAIA, the authors demonstrate substantial improvements in RS image classification, cross-modal retrieval, and captioning, while preserving generalization to non-RS tasks; they also show transfer to external RS captioning benchmarks. By releasing the dataset, automated processing framework, and pretrained weights, GAIA aims to accelerate the development of RS foundation models and broaden the applicability of VLMs to Earth-observation analyses.
Abstract
Existing Vision-Language Models (VLMs) are predominantly trained on web-scraped, noisy image-text data, exhibiting limited exposure to the specialized domain of RS. This deficiency results in poor performance on RS-specific tasks, as commonly used datasets often lack detailed, scientifically accurate textual descriptions and instead emphasize solely on attributes like date and location. To bridge this critical gap, we introduce GAIA, a novel dataset designed for multi-scale, multi-sensor, and multi-modal RS image analysis. GAIA comprises of 201,005 meticulously curated RS image-text pairs, representing a diverse range of RS modalities associated to different spatial resolutions. Unlike existing vision-language datasets in RS, GAIA specifically focuses on capturing a diverse range of RS applications, providing unique information about environmental changes, natural disasters, and various other dynamic phenomena. The dataset provides a spatially and temporally balanced distribution, spanning across the globe, covering the last 25 years with a balanced temporal distribution of observations. GAIA's construction involved a two-stage process: (1) targeted web-scraping of images and accompanying text from reputable RS-related sources, and (2) generation of five high-quality, scientifically grounded synthetic captions for each image using carefully crafted prompts that leverage the advanced vision-language capabilities of GPT-4o. Our extensive experiments, including fine-tuning of CLIP and BLIP2 models, demonstrate that GAIA significantly improves performance on RS image classification, cross-modal retrieval and image captioning tasks. We make our dataset, automated processing framework and fine-tuned model weights publicly available on our project's GitHub repository: https://github.com/Orion-AI-Lab/GAIA.
