Table of Contents
Fetching ...

CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models

Xiao An, Jiaxing Sun, Zihan Gui, Wei He

TL;DR

CHOICE addresses the lack of a unified, objective benchmark for remote-sensing capabilities in vision-language models by introducing a hierarchical, multi-task framework with 10,507 problems drawn from 50 cities. It delineates a two-tier capability schema (perception and reasoning) spanning 6 Level-2 and 23 Level-3 leaf tasks, evaluated via MCQs, visual-grounding coordinates, and segmentation references. Across 24 models (open-source general-domain, RSVLMs, and proprietary), CHOICE reveals that while some general-domain VLMs achieve strong image-level understanding and common-sense reasoning, fine-grained perception and domain-specific reasoning remain challenging, and RSVLMs do not consistently outperform general-domain models. The findings underscore the need for domain-aligned data and improved reasoning capabilities, and CHOICE provides a scalable, objective platform for future remote-sensing VLM development and benchmarking.

Abstract

The rapid advancement of Large Vision-Language Models (VLMs), both general-domain models and those specifically tailored for remote sensing, has demonstrated exceptional perception and reasoning capabilities in Earth observation tasks. However, a benchmark for systematically evaluating their capabilities in this domain is still lacking. To bridge this gap, we propose CHOICE, an extensive benchmark designed to objectively evaluate the hierarchical remote sensing capabilities of VLMs. Focusing on 2 primary capability dimensions essential to remote sensing: perception and reasoning, we further categorize 6 secondary dimensions and 23 leaf tasks to ensure a well-rounded assessment coverage. CHOICE guarantees the quality of all 10,507 problems through a rigorous process of data collection from 50 globally distributed cities, question construction and quality control. The newly curated data and the format of multiple-choice questions with definitive answers allow for an objective and straightforward performance assessment. Our evaluation of 3 proprietary and 21 open-source VLMs highlights their critical limitations within this specialized context. We hope that CHOICE will serve as a valuable resource and offer deeper insights into the challenges and potential of VLMs in the field of remote sensing. We will release CHOICE at [this https URL](https://github.com/ShawnAn-WHU/CHOICE).

CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models

TL;DR

CHOICE addresses the lack of a unified, objective benchmark for remote-sensing capabilities in vision-language models by introducing a hierarchical, multi-task framework with 10,507 problems drawn from 50 cities. It delineates a two-tier capability schema (perception and reasoning) spanning 6 Level-2 and 23 Level-3 leaf tasks, evaluated via MCQs, visual-grounding coordinates, and segmentation references. Across 24 models (open-source general-domain, RSVLMs, and proprietary), CHOICE reveals that while some general-domain VLMs achieve strong image-level understanding and common-sense reasoning, fine-grained perception and domain-specific reasoning remain challenging, and RSVLMs do not consistently outperform general-domain models. The findings underscore the need for domain-aligned data and improved reasoning capabilities, and CHOICE provides a scalable, objective platform for future remote-sensing VLM development and benchmarking.

Abstract

The rapid advancement of Large Vision-Language Models (VLMs), both general-domain models and those specifically tailored for remote sensing, has demonstrated exceptional perception and reasoning capabilities in Earth observation tasks. However, a benchmark for systematically evaluating their capabilities in this domain is still lacking. To bridge this gap, we propose CHOICE, an extensive benchmark designed to objectively evaluate the hierarchical remote sensing capabilities of VLMs. Focusing on 2 primary capability dimensions essential to remote sensing: perception and reasoning, we further categorize 6 secondary dimensions and 23 leaf tasks to ensure a well-rounded assessment coverage. CHOICE guarantees the quality of all 10,507 problems through a rigorous process of data collection from 50 globally distributed cities, question construction and quality control. The newly curated data and the format of multiple-choice questions with definitive answers allow for an objective and straightforward performance assessment. Our evaluation of 3 proprietary and 21 open-source VLMs highlights their critical limitations within this specialized context. We hope that CHOICE will serve as a valuable resource and offer deeper insights into the challenges and potential of VLMs in the field of remote sensing. We will release CHOICE at [this https URL](https://github.com/ShawnAn-WHU/CHOICE).

Paper Structure

This paper contains 28 sections, 15 figures, 16 tables.

Figures (15)

  • Figure 1: (left) The hierarchical capability taxonomy of CHOICE, which concentrates on perception and reasoning capabilities and is categorized into 6 Level-2 dimensions and 23 Level-3 leaf tasks. (right) Evaluation results of the 6 Level-2 capability dimensions across mainstream VLMs. Each circle is divided into three parts by three gray lines, which, ordered by area from largest to smallest, correspond to general-domain open-source VLMs, RSVLMs, and proprietary VLMs, respectively.
  • Figure 2: Examples of CHOICE from the 6 L-2 capabilities.
  • Figure 3: Overview of the construction of CHOICE. All RSIs, sourced around the world, are collected from various satellites, products, and platforms. There are three approaches to generating the problems and a human-involved quality control process guarantees the accuracy of all the questions.
  • Figure 4: Examples of L-3 tasks within the L-2 dimension of image-level comprehension.
  • Figure 5: Examples of L-3 tasks within the L-2 dimension of single-instance identification.
  • ...and 10 more figures