Table of Contents
Fetching ...

VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding

Xiang Li, Jian Ding, Mohamed Elhoseiny

TL;DR

VRSBench presents a large-scale, multi-task benchmark for remote sensing vision-language understanding, addressing prior dataset limitations by combining detailed, human-verified captions, diverse object referencing, and open-ended VQA within a unified framework. The authors introduce a four-step pipeline (attribute extraction, prompt engineering, GPT-4V inference, and human verification) to build high-quality annotations, and they provide three evaluation tasks to assess captioning, grounding, and VQA performance. Extensive experiments with LVLMs and GPT-4V reveal substantial gains from task-specific finetuning while underscoring the unique challenges of remote sensing data, such as fine-grained object details and complex spatial reasoning. The work also outlines future extensions to non-RGB modalities and emphasizes transparent data practices, reproducibility, and broad applicability in remote sensing and computer vision research.

Abstract

We introduce a new benchmark designed to advance the development of general-purpose, large-scale vision-language models for remote sensing images. Although several vision-language datasets in remote sensing have been proposed to pursue this goal, existing datasets are typically tailored to single tasks, lack detailed object information, or suffer from inadequate quality control. Exploring these improvement opportunities, we present a Versatile vision-language Benchmark for Remote Sensing image understanding, termed VRSBench. This benchmark comprises 29,614 images, with 29,614 human-verified detailed captions, 52,472 object references, and 123,221 question-answer pairs. It facilitates the training and evaluation of vision-language models across a broad spectrum of remote sensing image understanding tasks. We further evaluated state-of-the-art models on this benchmark for three vision-language tasks: image captioning, visual grounding, and visual question answering. Our work aims to significantly contribute to the development of advanced vision-language models in the field of remote sensing. The data and code can be accessed at https://github.com/lx709/VRSBench.

VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding

TL;DR

VRSBench presents a large-scale, multi-task benchmark for remote sensing vision-language understanding, addressing prior dataset limitations by combining detailed, human-verified captions, diverse object referencing, and open-ended VQA within a unified framework. The authors introduce a four-step pipeline (attribute extraction, prompt engineering, GPT-4V inference, and human verification) to build high-quality annotations, and they provide three evaluation tasks to assess captioning, grounding, and VQA performance. Extensive experiments with LVLMs and GPT-4V reveal substantial gains from task-specific finetuning while underscoring the unique challenges of remote sensing data, such as fine-grained object details and complex spatial reasoning. The work also outlines future extensions to non-RGB modalities and emphasizes transparent data practices, reproducibility, and broad applicability in remote sensing and computer vision research.

Abstract

We introduce a new benchmark designed to advance the development of general-purpose, large-scale vision-language models for remote sensing images. Although several vision-language datasets in remote sensing have been proposed to pursue this goal, existing datasets are typically tailored to single tasks, lack detailed object information, or suffer from inadequate quality control. Exploring these improvement opportunities, we present a Versatile vision-language Benchmark for Remote Sensing image understanding, termed VRSBench. This benchmark comprises 29,614 images, with 29,614 human-verified detailed captions, 52,472 object references, and 123,221 question-answer pairs. It facilitates the training and evaluation of vision-language models across a broad spectrum of remote sensing image understanding tasks. We further evaluated state-of-the-art models on this benchmark for three vision-language tasks: image captioning, visual grounding, and visual question answering. Our work aims to significantly contribute to the development of advanced vision-language models in the field of remote sensing. The data and code can be accessed at https://github.com/lx709/VRSBench.
Paper Structure (44 sections, 8 figures, 7 tables)

This paper contains 44 sections, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Examples of an image and corresponding annotations in VRSBench dataset. Our annotations include object referring, visual question answering, and detailed captions.
  • Figure 2: Dataset creation pipeline. We generate object information from detection labels and use carefully designed instructions to prompt GPT-4 to generate annotations from input images along with object information. All annotations are verified by human annotators.
  • Figure 3: Statistics of the VRSBench caption dataset. (a) Probability density function (PDF) of caption length. (b) PDF of the sentence number. (c) Summative statistics.
  • Figure 4: Statistics of object referring sentences of VRSBench dataset. (a) Distribution of the 10 most frequent object categories. (b) Distribution of the word length of referring sentences. (c) Distribution of object size. (d)Word cloud of the top 50 words in referring sentences. (e) Distribution of unique/non-unique objects in each category.
  • Figure 5: Statistics of question-answer pairs in VRSBench. (a) Distribution of question types. (b) Word cloud of top 50 most frequent words in questions. (c) Word cloud of top 50 most frequent words in answers.
  • ...and 3 more figures