Table of Contents
Fetching ...

XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?

Fengxiang Wang, Hongzhen Wang, Mingshuo Chen, Di Wang, Yulin Wang, Zonghao Guo, Qiang Ma, Long Lan, Wenjing Yang, Jing Zhang, Zhiyuan Liu, Maosong Sun

TL;DR

XLRS-Bench introduces the largest-scale, manually annotated, bilingual benchmark for evaluating multimodal large language models on ultra-high-resolution remote sensing imagery. It defines 16 sub-tasks across 10 perceptual and 6 reasoning capabilities, supported by VQA, detailed captioning, and visual grounding in English and Chinese, and supported by a semi-automated captioning pipeline. Experimental results show current MLLMs struggle significantly with ultra-high-resolution RS perception and spatiotemporal reasoning, though higher-resolution input models offer some gains, particularly in perception tasks. The benchmark is open-sourced to accelerate RS-tailored MLLMs and foster progress toward robust, real-world RS scene understanding.

Abstract

The astonishing breakthrough of multimodal large language models (MLLMs) has necessitated new benchmarks to quantitatively assess their capabilities, reveal their limitations, and indicate future research directions. However, this is challenging in the context of remote sensing (RS), since the imagery features ultra-high resolution that incorporates extremely complex semantic relationships. Existing benchmarks usually adopt notably smaller image sizes than real-world RS scenarios, suffer from limited annotation quality, and consider insufficient dimensions of evaluation. To address these issues, we present XLRS-Bench: a comprehensive benchmark for evaluating the perception and reasoning capabilities of MLLMs in ultra-high-resolution RS scenarios. XLRS-Bench boasts the largest average image size (8500$\times$8500) observed thus far, with all evaluation samples meticulously annotated manually, assisted by a novel semi-automatic captioner on ultra-high-resolution RS images. On top of the XLRS-Bench, 16 sub-tasks are defined to evaluate MLLMs' 10 kinds of perceptual capabilities and 6 kinds of reasoning capabilities, with a primary emphasis on advanced cognitive processes that facilitate real-world decision-making and the capture of spatiotemporal changes. The results of both general and RS-focused MLLMs on XLRS-Bench indicate that further efforts are needed for real-world RS applications. We have open-sourced XLRS-Bench to support further research in developing more powerful MLLMs for remote sensing.

XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery?

TL;DR

XLRS-Bench introduces the largest-scale, manually annotated, bilingual benchmark for evaluating multimodal large language models on ultra-high-resolution remote sensing imagery. It defines 16 sub-tasks across 10 perceptual and 6 reasoning capabilities, supported by VQA, detailed captioning, and visual grounding in English and Chinese, and supported by a semi-automated captioning pipeline. Experimental results show current MLLMs struggle significantly with ultra-high-resolution RS perception and spatiotemporal reasoning, though higher-resolution input models offer some gains, particularly in perception tasks. The benchmark is open-sourced to accelerate RS-tailored MLLMs and foster progress toward robust, real-world RS scene understanding.

Abstract

The astonishing breakthrough of multimodal large language models (MLLMs) has necessitated new benchmarks to quantitatively assess their capabilities, reveal their limitations, and indicate future research directions. However, this is challenging in the context of remote sensing (RS), since the imagery features ultra-high resolution that incorporates extremely complex semantic relationships. Existing benchmarks usually adopt notably smaller image sizes than real-world RS scenarios, suffer from limited annotation quality, and consider insufficient dimensions of evaluation. To address these issues, we present XLRS-Bench: a comprehensive benchmark for evaluating the perception and reasoning capabilities of MLLMs in ultra-high-resolution RS scenarios. XLRS-Bench boasts the largest average image size (85008500) observed thus far, with all evaluation samples meticulously annotated manually, assisted by a novel semi-automatic captioner on ultra-high-resolution RS images. On top of the XLRS-Bench, 16 sub-tasks are defined to evaluate MLLMs' 10 kinds of perceptual capabilities and 6 kinds of reasoning capabilities, with a primary emphasis on advanced cognitive processes that facilitate real-world decision-making and the capture of spatiotemporal changes. The results of both general and RS-focused MLLMs on XLRS-Bench indicate that further efforts are needed for real-world RS applications. We have open-sourced XLRS-Bench to support further research in developing more powerful MLLMs for remote sensing.

Paper Structure

This paper contains 32 sections, 21 figures, 9 tables.

Figures (21)

  • Figure 1: Advantages of XLRS-Bench: XLRS-Bench boasts an average image size that is 24 times larger than existing datasets.
  • Figure 2: XLRS-Bench evaluates the perception and reasoning capabilities of MLLMs across three levels and 16 sub-tasks.
  • Figure 3: Semi-automated pipeline for detailed image captioning in XLRS-Bench.
  • Figure 4: Evaluation results of XLRS-Bench and MLLMs. “RP”, “AD”, “ECR”, “OCC”, “RC”, “CCR”, “RCCD”, “OLUC”, “RLUC”, “OSR”, “OCC”, “OCL”, and “OMS” each indicate a specific task domain: Route Planning, Anomaly Detection, Environmental Conditional Reasoning, Overall Counting, Regional Counting, Counting with Complex Reasoning, Regional Counting with Change Detection, Overall Land Use Classification, Regional Land Use Classification, Object Spatial Relationship, Object Classification, Object Color and Object Motion State.
  • Figure 5: Example of XLRS-Bench in English. XLRS-Bench focuses on large-size ultra-high-resolution remote sensing imagery, integrating over 10 multimodal perception and reasoning tasks within the same image.
  • ...and 16 more figures