CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models

Xiao An; Jiaxing Sun; Zihan Gui; Wei He

CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models

Xiao An, Jiaxing Sun, Zihan Gui, Wei He

TL;DR

CHOICE addresses the lack of a unified, objective benchmark for remote-sensing capabilities in vision-language models by introducing a hierarchical, multi-task framework with 10,507 problems drawn from 50 cities. It delineates a two-tier capability schema (perception and reasoning) spanning 6 Level-2 and 23 Level-3 leaf tasks, evaluated via MCQs, visual-grounding coordinates, and segmentation references. Across 24 models (open-source general-domain, RSVLMs, and proprietary), CHOICE reveals that while some general-domain VLMs achieve strong image-level understanding and common-sense reasoning, fine-grained perception and domain-specific reasoning remain challenging, and RSVLMs do not consistently outperform general-domain models. The findings underscore the need for domain-aligned data and improved reasoning capabilities, and CHOICE provides a scalable, objective platform for future remote-sensing VLM development and benchmarking.

Abstract

The rapid advancement of Large Vision-Language Models (VLMs), both general-domain models and those specifically tailored for remote sensing, has demonstrated exceptional perception and reasoning capabilities in Earth observation tasks. However, a benchmark for systematically evaluating their capabilities in this domain is still lacking. To bridge this gap, we propose CHOICE, an extensive benchmark designed to objectively evaluate the hierarchical remote sensing capabilities of VLMs. Focusing on 2 primary capability dimensions essential to remote sensing: perception and reasoning, we further categorize 6 secondary dimensions and 23 leaf tasks to ensure a well-rounded assessment coverage. CHOICE guarantees the quality of all 10,507 problems through a rigorous process of data collection from 50 globally distributed cities, question construction and quality control. The newly curated data and the format of multiple-choice questions with definitive answers allow for an objective and straightforward performance assessment. Our evaluation of 3 proprietary and 21 open-source VLMs highlights their critical limitations within this specialized context. We hope that CHOICE will serve as a valuable resource and offer deeper insights into the challenges and potential of VLMs in the field of remote sensing. We will release CHOICE at [this https URL](https://github.com/ShawnAn-WHU/CHOICE).

CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models

TL;DR

Abstract

CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (15)