Table of Contents
Fetching ...

GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis

Angelos Zavras, Dimitrios Michail, Xiao Xiang Zhu, Begüm Demir, Ioannis Papoutsis

TL;DR

GAIA tackles the lack of high-quality, domain-specific image-text data for remote sensing vision-language models by creating a large, global RS dataset with 40,201 images and five synthetic captions per image (201,005 image-text pairs) collected over 25 years. The dataset combines targeted web-scraping from reputable RS sources with GPT-4o-based generation of scientifically grounded captions and rich metadata, enabling robust multi-modal learning across diverse RS modalities and resolutions. Through fine-tuning CLIP and BLIP-2 on GAIA, the authors demonstrate substantial improvements in RS image classification, cross-modal retrieval, and captioning, while preserving generalization to non-RS tasks; they also show transfer to external RS captioning benchmarks. By releasing the dataset, automated processing framework, and pretrained weights, GAIA aims to accelerate the development of RS foundation models and broaden the applicability of VLMs to Earth-observation analyses.

Abstract

Existing Vision-Language Models (VLMs) are predominantly trained on web-scraped, noisy image-text data, exhibiting limited exposure to the specialized domain of RS. This deficiency results in poor performance on RS-specific tasks, as commonly used datasets often lack detailed, scientifically accurate textual descriptions and instead emphasize solely on attributes like date and location. To bridge this critical gap, we introduce GAIA, a novel dataset designed for multi-scale, multi-sensor, and multi-modal RS image analysis. GAIA comprises of 201,005 meticulously curated RS image-text pairs, representing a diverse range of RS modalities associated to different spatial resolutions. Unlike existing vision-language datasets in RS, GAIA specifically focuses on capturing a diverse range of RS applications, providing unique information about environmental changes, natural disasters, and various other dynamic phenomena. The dataset provides a spatially and temporally balanced distribution, spanning across the globe, covering the last 25 years with a balanced temporal distribution of observations. GAIA's construction involved a two-stage process: (1) targeted web-scraping of images and accompanying text from reputable RS-related sources, and (2) generation of five high-quality, scientifically grounded synthetic captions for each image using carefully crafted prompts that leverage the advanced vision-language capabilities of GPT-4o. Our extensive experiments, including fine-tuning of CLIP and BLIP2 models, demonstrate that GAIA significantly improves performance on RS image classification, cross-modal retrieval and image captioning tasks. We make our dataset, automated processing framework and fine-tuned model weights publicly available on our project's GitHub repository: https://github.com/Orion-AI-Lab/GAIA.

GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis

TL;DR

GAIA tackles the lack of high-quality, domain-specific image-text data for remote sensing vision-language models by creating a large, global RS dataset with 40,201 images and five synthetic captions per image (201,005 image-text pairs) collected over 25 years. The dataset combines targeted web-scraping from reputable RS sources with GPT-4o-based generation of scientifically grounded captions and rich metadata, enabling robust multi-modal learning across diverse RS modalities and resolutions. Through fine-tuning CLIP and BLIP-2 on GAIA, the authors demonstrate substantial improvements in RS image classification, cross-modal retrieval, and captioning, while preserving generalization to non-RS tasks; they also show transfer to external RS captioning benchmarks. By releasing the dataset, automated processing framework, and pretrained weights, GAIA aims to accelerate the development of RS foundation models and broaden the applicability of VLMs to Earth-observation analyses.

Abstract

Existing Vision-Language Models (VLMs) are predominantly trained on web-scraped, noisy image-text data, exhibiting limited exposure to the specialized domain of RS. This deficiency results in poor performance on RS-specific tasks, as commonly used datasets often lack detailed, scientifically accurate textual descriptions and instead emphasize solely on attributes like date and location. To bridge this critical gap, we introduce GAIA, a novel dataset designed for multi-scale, multi-sensor, and multi-modal RS image analysis. GAIA comprises of 201,005 meticulously curated RS image-text pairs, representing a diverse range of RS modalities associated to different spatial resolutions. Unlike existing vision-language datasets in RS, GAIA specifically focuses on capturing a diverse range of RS applications, providing unique information about environmental changes, natural disasters, and various other dynamic phenomena. The dataset provides a spatially and temporally balanced distribution, spanning across the globe, covering the last 25 years with a balanced temporal distribution of observations. GAIA's construction involved a two-stage process: (1) targeted web-scraping of images and accompanying text from reputable RS-related sources, and (2) generation of five high-quality, scientifically grounded synthetic captions for each image using carefully crafted prompts that leverage the advanced vision-language capabilities of GPT-4o. Our extensive experiments, including fine-tuning of CLIP and BLIP2 models, demonstrate that GAIA significantly improves performance on RS image classification, cross-modal retrieval and image captioning tasks. We make our dataset, automated processing framework and fine-tuned model weights publicly available on our project's GitHub repository: https://github.com/Orion-AI-Lab/GAIA.

Paper Structure

This paper contains 18 sections, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Representative samples from the LAION-EO czerkawski2023laion dataset illustrate a fundamental limitation inherent in web-scraped image-text paired datasets for remote sensing: despite images exhibit sufficient visual fidelity, the accompanying textual descriptions are characterized by high noise levels and lack domain-specific details, diminishing their utility for Earth Observation tasks.
  • Figure 2: A qualitative comparison of remote sensing (RS) image-text datasets, highlighting differences in caption length, number of captions per RS image, level of detail, and domain specificity. Notably, our GAIA dataset (bottom right) features interpretative, context-aware, and RS domain-specific captions, which differs from the predominantly object-level descriptions of datasets like UCM-Captions and RSICD.
  • Figure 3: Spatial coverage and distribution of the full GAIA dataset, in conjunction with the train, test, and validation sets. The main figure (top) illustrates the global spatial distribution of samples in the GAIA dataset. Each red dot represents a location associated with an image-text pair. The dataset exhibits broad coverage across various regions, with higher concentrations observed over North America, Europe, and parts of Asia and South America, including the often neglected Antarctic region. The three smaller maps (bottom) display the spatial distribution for the train set (orange), test set (green), and validation set (blue), respectively. This visualization demonstrates that the GAIA dataset provides a geographically diverse representation of Earth's surface, which is crucial for training and evaluating robust RS models. The train, test, and validation sets maintain a similar spatial distribution pattern to the overall dataset, ensuring consistency across different data splits.
  • Figure 4: Overview of GAIA data acquisition and annotation pipeline. This figure illustrates the process of building the GAIA dataset, from initial data acquisition to the generation of enriched metadata and captions. The pipeline begins with web-scraped RS articles (image + text) as the foundational data. This raw data undergoes a rigorous data cleaning and de-duplication process, which also involves text summarization using GPT-4o-mini in cases of large articles. Subsequently, GPT-4o is employed to extract metadata from the cleaned data and generate more descriptive captions, resulting in the comprehensive GAIA dataset.
  • Figure 5: A glimpse into the GAIA dataset, showing various satellite images alongside metadata, the original alt-text and our five descriptive synthetic captions. This diverse content highlights the dataset's heterogeneity, as well as the enhanced descriptive richness achieved through our synthetic captions.
  • ...and 9 more figures