A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs

Yunkai Dang; Meiyi Zhu; Donghao Wang; Yizhuo Zhang; Jiacheng Yang; Qi Fan; Yuekun Yang; Wenbin Li; Feng Miao; Yang Gao

A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs

Yunkai Dang, Meiyi Zhu, Donghao Wang, Yizhuo Zhang, Jiacheng Yang, Qi Fan, Yuekun Yang, Wenbin Li, Feng Miao, Yang Gao

TL;DR

RSHR-Bench introduces a large-scale, ultra-high-resolution remote sensing benchmark designed to fairly evaluate vision–language models on native RS imagery. By coupling four task families with 13 prompts and a rigorous human–LLM verification pipeline, the benchmark emphasizes faithful visual grounding and multi-turn reasoning across scenes up to ~3×10^8 pixels. Experimental results show persistent performance gaps across open-/closed-source VLMs and even strong text-only LLMs, underscoring the need for RS-specific grounding, multi-image fusion, and robust uncertainty handling. The work provides a comprehensive dataset, evaluation protocol, and prompt templates to advance remote-sensing VLM development toward real-world application needs.

Abstract

Multimodal large language models (MLLMs) demonstrate strong perception and reasoning performance on existing remote sensing (RS) benchmarks. However, most prior benchmarks rely on low-resolution imagery, and some high-resolution benchmarks suffer from flawed reasoning-task designs. We show that text-only LLMs can perform competitively with multimodal vision-language models on RS reasoning tasks without access to images, revealing a critical mismatch between current benchmarks and the intended evaluation of visual understanding. To enable faithful assessment, we introduce RSHR-Bench, a super-high-resolution benchmark for RS visual understanding and reasoning. RSHR-Bench contains 5,329 full-scene images with a long side of at least 4,000 pixels, with up to about 3 x 10^8 pixels per image, sourced from widely used RS corpora and UAV collections. We design four task families: multiple-choice VQA, open-ended VQA, image captioning, and single-image evaluation. These tasks cover nine perception categories and four reasoning types, supporting multi-turn and multi-image dialog. To reduce reliance on language priors, we apply adversarial filtering with strong LLMs followed by rigorous human verification. Overall, we construct 3,864 VQA tasks, 3,913 image captioning tasks, and 500 fully human-written or verified single-image evaluation VQA pairs. Evaluations across open-source, closed-source, and RS-specific VLMs reveal persistent performance gaps in super-high-resolution scenarios. Code: https://github.com/Yunkaidang/RSHR

A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs

TL;DR

Abstract

A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (24)