Table of Contents
Fetching ...

A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs

Yunkai Dang, Meiyi Zhu, Donghao Wang, Yizhuo Zhang, Jiacheng Yang, Qi Fan, Yuekun Yang, Wenbin Li, Feng Miao, Yang Gao

TL;DR

RSHR-Bench introduces a large-scale, ultra-high-resolution remote sensing benchmark designed to fairly evaluate vision–language models on native RS imagery. By coupling four task families with 13 prompts and a rigorous human–LLM verification pipeline, the benchmark emphasizes faithful visual grounding and multi-turn reasoning across scenes up to ~3×10^8 pixels. Experimental results show persistent performance gaps across open-/closed-source VLMs and even strong text-only LLMs, underscoring the need for RS-specific grounding, multi-image fusion, and robust uncertainty handling. The work provides a comprehensive dataset, evaluation protocol, and prompt templates to advance remote-sensing VLM development toward real-world application needs.

Abstract

Multimodal large language models (MLLMs) demonstrate strong perception and reasoning performance on existing remote sensing (RS) benchmarks. However, most prior benchmarks rely on low-resolution imagery, and some high-resolution benchmarks suffer from flawed reasoning-task designs. We show that text-only LLMs can perform competitively with multimodal vision-language models on RS reasoning tasks without access to images, revealing a critical mismatch between current benchmarks and the intended evaluation of visual understanding. To enable faithful assessment, we introduce RSHR-Bench, a super-high-resolution benchmark for RS visual understanding and reasoning. RSHR-Bench contains 5,329 full-scene images with a long side of at least 4,000 pixels, with up to about 3 x 10^8 pixels per image, sourced from widely used RS corpora and UAV collections. We design four task families: multiple-choice VQA, open-ended VQA, image captioning, and single-image evaluation. These tasks cover nine perception categories and four reasoning types, supporting multi-turn and multi-image dialog. To reduce reliance on language priors, we apply adversarial filtering with strong LLMs followed by rigorous human verification. Overall, we construct 3,864 VQA tasks, 3,913 image captioning tasks, and 500 fully human-written or verified single-image evaluation VQA pairs. Evaluations across open-source, closed-source, and RS-specific VLMs reveal persistent performance gaps in super-high-resolution scenarios. Code: https://github.com/Yunkaidang/RSHR

A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs

TL;DR

RSHR-Bench introduces a large-scale, ultra-high-resolution remote sensing benchmark designed to fairly evaluate vision–language models on native RS imagery. By coupling four task families with 13 prompts and a rigorous human–LLM verification pipeline, the benchmark emphasizes faithful visual grounding and multi-turn reasoning across scenes up to ~3×10^8 pixels. Experimental results show persistent performance gaps across open-/closed-source VLMs and even strong text-only LLMs, underscoring the need for RS-specific grounding, multi-image fusion, and robust uncertainty handling. The work provides a comprehensive dataset, evaluation protocol, and prompt templates to advance remote-sensing VLM development toward real-world application needs.

Abstract

Multimodal large language models (MLLMs) demonstrate strong perception and reasoning performance on existing remote sensing (RS) benchmarks. However, most prior benchmarks rely on low-resolution imagery, and some high-resolution benchmarks suffer from flawed reasoning-task designs. We show that text-only LLMs can perform competitively with multimodal vision-language models on RS reasoning tasks without access to images, revealing a critical mismatch between current benchmarks and the intended evaluation of visual understanding. To enable faithful assessment, we introduce RSHR-Bench, a super-high-resolution benchmark for RS visual understanding and reasoning. RSHR-Bench contains 5,329 full-scene images with a long side of at least 4,000 pixels, with up to about 3 x 10^8 pixels per image, sourced from widely used RS corpora and UAV collections. We design four task families: multiple-choice VQA, open-ended VQA, image captioning, and single-image evaluation. These tasks cover nine perception categories and four reasoning types, supporting multi-turn and multi-image dialog. To reduce reliance on language priors, we apply adversarial filtering with strong LLMs followed by rigorous human verification. Overall, we construct 3,864 VQA tasks, 3,913 image captioning tasks, and 500 fully human-written or verified single-image evaluation VQA pairs. Evaluations across open-source, closed-source, and RS-specific VLMs reveal persistent performance gaps in super-high-resolution scenarios. Code: https://github.com/Yunkaidang/RSHR

Paper Structure

This paper contains 13 sections, 1 equation, 24 figures, 11 tables.

Figures (24)

  • Figure 1: Accuracy on XLRS-Bench Wang2025XLRSBench and RSHR-Bench. Tasks: AD (Anomaly Detection) and ECR (Existence & Counting Reasoning). We report the average reasoning accuracy under two input settings: text-only (Llama3-8B and Qwen3-8B) and multimodal (image+text; GPT-4o and GPT-4o mini). RSHR-Bench exhibits a larger gap between text-only and multimodal settings, indicating stronger reliance on visual information.
  • Figure 1: Cases from different tasks: color detection, shape/margin recognition, orientation detection, and classification.
  • Figure 2: This overview shows the construction of our RSHR-Bench: We collect high-resolution imagery from multiple datasets and supplement it with images from our own UAV dataset. Then we generate questions, followed by LLM and human verification. The resulting tasks are categorized into four main types, covering various VLMs evaluation experiments. Finally, on the right, examples of single-image understanding tasks illustrate how the benchmark is applied.
  • Figure 2: Cases from different tasks: object spatial relationship, object grounding, regional grounding, and object counting.
  • Figure 3: Overview of our benchmark composition. Left: task categories for perception and reasoning. Right: counts of tasks within the four capability groups—multiple-choice VQA (MCQ), open-ended questions (OEQ), image captioning (IC), and single-image evaluation (SIE).
  • ...and 19 more figures