Table of Contents
Fetching ...

HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?

Yusen Zhang, Wenliang Zheng, Aashrith Madasu, Peng Shi, Ryo Kamoi, Hao Zhou, Zhuoyang Zou, Shu Zhao, Sarkar Snigdha Sarathi Das, Vipul Gupta, Xiaoxin Lu, Nan Zhang, Ranran Haoran Zhang, Avitej Iyer, Renze Lou, Wenpeng Yin, Rui Zhang

TL;DR

HRScene introduces a unified benchmark for high-resolution image understanding, compiling 25 real-world datasets plus 2 diagnostic datasets with resolutions spanning from $1024 \times 1024$ to $35503 \times 26627$. By evaluating 28 Vision-Language Models, the study reveals that current VLMs achieve roughly $50\%$ accuracy on real-world HRIs and exhibit distinct failure modes, notably Regional Divergence in large images and a Lost-in-the-Middle effect in diagnostic sub-tasks. The work provides a detailed data-collection and annotation protocol, a synthetic diagnostic suite to probe region usage, and extensive analyses on model size, global-local perception trade-offs, and multi-image composition. The findings establish clear directions for future research, including improved HRI processors, better region-aware reasoning, and robust handling of multi-image high-resolution inputs, with HRScene serving as an ongoing leaderboard for fair cross-model comparisons.

Abstract

High-resolution image (HRI) understanding aims to process images with a large number of pixels, such as pathological images and agricultural aerial images, both of which can exceed 1 million pixels. Vision Large Language Models (VLMs) can allegedly handle HRIs, however, there is a lack of a comprehensive benchmark for VLMs to evaluate HRI understanding. To address this gap, we introduce HRScene, a novel unified benchmark for HRI understanding with rich scenes. HRScene incorporates 25 real-world datasets and 2 synthetic diagnostic datasets with resolutions ranging from 1,024 $\times$ 1,024 to 35,503 $\times$ 26,627. HRScene is collected and re-annotated by 10 graduate-level annotators, covering 25 scenarios, ranging from microscopic to radiology images, street views, long-range pictures, and telescope images. It includes HRIs of real-world objects, scanned documents, and composite multi-image. The two diagnostic evaluation datasets are synthesized by combining the target image with the gold answer and distracting images in different orders, assessing how well models utilize regions in HRI. We conduct extensive experiments involving 28 VLMs, including Gemini 2.0 Flash and GPT-4o. Experiments on HRScene show that current VLMs achieve an average accuracy of around 50% on real-world tasks, revealing significant gaps in HRI understanding. Results on synthetic datasets reveal that VLMs struggle to effectively utilize HRI regions, showing significant Regional Divergence and lost-in-middle, shedding light on future research.

HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?

TL;DR

HRScene introduces a unified benchmark for high-resolution image understanding, compiling 25 real-world datasets plus 2 diagnostic datasets with resolutions spanning from to . By evaluating 28 Vision-Language Models, the study reveals that current VLMs achieve roughly accuracy on real-world HRIs and exhibit distinct failure modes, notably Regional Divergence in large images and a Lost-in-the-Middle effect in diagnostic sub-tasks. The work provides a detailed data-collection and annotation protocol, a synthetic diagnostic suite to probe region usage, and extensive analyses on model size, global-local perception trade-offs, and multi-image composition. The findings establish clear directions for future research, including improved HRI processors, better region-aware reasoning, and robust handling of multi-image high-resolution inputs, with HRScene serving as an ongoing leaderboard for fair cross-model comparisons.

Abstract

High-resolution image (HRI) understanding aims to process images with a large number of pixels, such as pathological images and agricultural aerial images, both of which can exceed 1 million pixels. Vision Large Language Models (VLMs) can allegedly handle HRIs, however, there is a lack of a comprehensive benchmark for VLMs to evaluate HRI understanding. To address this gap, we introduce HRScene, a novel unified benchmark for HRI understanding with rich scenes. HRScene incorporates 25 real-world datasets and 2 synthetic diagnostic datasets with resolutions ranging from 1,024 1,024 to 35,503 26,627. HRScene is collected and re-annotated by 10 graduate-level annotators, covering 25 scenarios, ranging from microscopic to radiology images, street views, long-range pictures, and telescope images. It includes HRIs of real-world objects, scanned documents, and composite multi-image. The two diagnostic evaluation datasets are synthesized by combining the target image with the gold answer and distracting images in different orders, assessing how well models utilize regions in HRI. We conduct extensive experiments involving 28 VLMs, including Gemini 2.0 Flash and GPT-4o. Experiments on HRScene show that current VLMs achieve an average accuracy of around 50% on real-world tasks, revealing significant gaps in HRI understanding. Results on synthetic datasets reveal that VLMs struggle to effectively utilize HRI regions, showing significant Regional Divergence and lost-in-middle, shedding light on future research.

Paper Structure

This paper contains 27 sections, 8 figures, 32 tables.

Figures (8)

  • Figure 1: (a) Overview taxonomy of the HRScene. (b) Performance of some VLMs on HRScene. (c) Comparison between the benchmarks that the mainstream VLMs are evaluated on and HRScene. The y-axis is the $\sqrt{\text{total pixel}}$. The boxes/icons indicate the image resolution they contain/support. The black lines inside each box show the average resolutions.
  • Figure 2: Distribution of resolution of each dataset. X-axis is the resolution and $n$k indicates the resolution is at least $n^2*10^6$ pixels.
  • Figure 3: Some examples of HRScene. Blue ones are diagnostic datasets and purple ones are real-world datasets.
  • Figure 4: Performance of the regions averaged across all dataset points and all 18 VLMs. X-Axis is the Manhattan distance to the left upper corner, $|x-1| + |y-1|$ where $x,y$ is the row and column of the needle image, while the y-axis is the performance of that sample. With the increase of the x-axis, the performance of the model exhibits a U-shape, with much lower performance in the middle. With the increase in the image size, the shape becomes more significant.
  • Figure 5: Detailed performance of some models on two diagnose datasets.
  • ...and 3 more figures