ChatEarthNet: A Global-Scale Image-Text Dataset Empowering Vision-Language Geo-Foundation Models

Zhenghang Yuan; Zhitong Xiong; Lichao Mou; Xiao Xiang Zhu

ChatEarthNet: A Global-Scale Image-Text Dataset Empowering Vision-Language Geo-Foundation Models

Zhenghang Yuan, Zhitong Xiong, Lichao Mou, Xiao Xiang Zhu

TL;DR

ChatEarthNet addresses the scarcity of high-quality, geotagged image-text data for remote sensing by pairing global Sentinel-2 imagery with WorldCover land-cover semantics and generating rich captions via two prompting paradigms for ChatGPT-3.5 and ChatGPT-4V. The dataset comprises 163,488 image-text pairs from ChatGPT-3.5 and a 10,000-pair GPT-4V subset, with a rigorous manual verification step to enhance caption accuracy. The authors demonstrate global coverage (excluding Antarctica), diverse land-cover descriptions, and a detailed comparative analysis of caption length, vocabulary, and structure between the two LLMs. This resource is poised to advance vision-language geo-foundation models and provide a valuable benchmark for evaluating large vision-language models in remote sensing, with a public release planned for the community.

Abstract

An in-depth comprehension of global land cover is essential in Earth observation, forming the foundation for a multitude of applications. Although remote sensing technology has advanced rapidly, leading to a proliferation of satellite imagery, the inherent complexity of these images often makes them difficult for non-expert users to understand. Natural language, as a carrier of human knowledge, can be a bridge between common users and complicated satellite imagery. In this context, we introduce a global-scale, high-quality image-text dataset for remote sensing, providing natural language descriptions for Sentinel-2 data to facilitate the understanding of satellite imagery for common users. Specifically, we utilize Sentinel-2 data for its global coverage as the foundational image source, employing semantic segmentation labels from the European Space Agency's (ESA) WorldCover project to enrich the descriptions of land covers. By conducting in-depth semantic analysis, we formulate detailed prompts to elicit rich descriptions from ChatGPT. To enhance the dataset's quality, we introduce the manual verification process. This step involves manual inspection and correction to refine the dataset, thus significantly improving its accuracy and quality. Finally, we offer the community ChatEarthNet, a large-scale image-text dataset characterized by global coverage, high quality, wide-ranging diversity, and detailed descriptions. ChatEarthNet consists of 163,488 image-text pairs with captions generated by ChatGPT-3.5 and an additional 10,000 image-text pairs with captions generated by ChatGPT-4V(ision). This dataset has significant potential for training vision-language geo-foundation models and evaluating large vision-language models for remote sensing. The dataset will be made publicly available.

ChatEarthNet: A Global-Scale Image-Text Dataset Empowering Vision-Language Geo-Foundation Models

TL;DR

Abstract

Paper Structure (14 sections, 15 figures, 1 table, 3 algorithms)

This paper contains 14 sections, 15 figures, 1 table, 3 algorithms.

Dataset and Methodology
Sentinel-2 Data in ChatEarthNet
Land Cover Map from WorldCover Product
Prompt Design
Prompt Design for ChatGPT-3.5
Prompt Design for ChatGPT-4V
Manual verification
Dataset Analysis and Discussion
Dataset overview
Geographic coverage
Word Frequency
Caption length
Visualization and Comparison
Conclusion

Figures (15)

Figure 1: Comparative Visualization of Image-Text Pairs across UCM-Captions qu2016deep, Sydney-Captions qu2016deep, RSICD lu2017exploring, NWPU-Captions cheng2022nwpu, RSICap hu2023rsgpt, RS5M zhang2023rs5m, SkyScript wang2023skyscript Datasets.
Figure 2: The upper-left part of the figure displays the geographical distribution of the Sentinel-2 data used in the ChatEarthNet dataset. The lower-left part shows the temporal distribution of the Sentinel-2 data used. The right part visualizes some examples of the images and the nine spectral bands used in the dataset.
Figure 4: An overview of the ChatEarthNet dataset. We randomly select image-text samples from four different locations. The left and top sides display the descriptions generated by ChatGPT-4V. While the right and bottom sides show two samples produced by ChatGPT-3.5. We use different colors to highlight the words of different land cover types.
Figure 5: Geographical distribution of image-text pairs using ChatGPT-3.5
Figure 6: Geographical distribution of image-text pairs using ChatGPT-4V.
...and 10 more figures

ChatEarthNet: A Global-Scale Image-Text Dataset Empowering Vision-Language Geo-Foundation Models

TL;DR

Abstract

ChatEarthNet: A Global-Scale Image-Text Dataset Empowering Vision-Language Geo-Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (15)