HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?

Yusen Zhang; Wenliang Zheng; Aashrith Madasu; Peng Shi; Ryo Kamoi; Hao Zhou; Zhuoyang Zou; Shu Zhao; Sarkar Snigdha Sarathi Das; Vipul Gupta; Xiaoxin Lu; Nan Zhang; Ranran Haoran Zhang; Avitej Iyer; Renze Lou; Wenpeng Yin; Rui Zhang

HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?

Yusen Zhang, Wenliang Zheng, Aashrith Madasu, Peng Shi, Ryo Kamoi, Hao Zhou, Zhuoyang Zou, Shu Zhao, Sarkar Snigdha Sarathi Das, Vipul Gupta, Xiaoxin Lu, Nan Zhang, Ranran Haoran Zhang, Avitej Iyer, Renze Lou, Wenpeng Yin, Rui Zhang

TL;DR

HRScene introduces a unified benchmark for high-resolution image understanding, compiling 25 real-world datasets plus 2 diagnostic datasets with resolutions spanning from $1024 \times 1024$ to $35503 \times 26627$. By evaluating 28 Vision-Language Models, the study reveals that current VLMs achieve roughly $50\%$ accuracy on real-world HRIs and exhibit distinct failure modes, notably Regional Divergence in large images and a Lost-in-the-Middle effect in diagnostic sub-tasks. The work provides a detailed data-collection and annotation protocol, a synthetic diagnostic suite to probe region usage, and extensive analyses on model size, global-local perception trade-offs, and multi-image composition. The findings establish clear directions for future research, including improved HRI processors, better region-aware reasoning, and robust handling of multi-image high-resolution inputs, with HRScene serving as an ongoing leaderboard for fair cross-model comparisons.

Abstract

High-resolution image (HRI) understanding aims to process images with a large number of pixels, such as pathological images and agricultural aerial images, both of which can exceed 1 million pixels. Vision Large Language Models (VLMs) can allegedly handle HRIs, however, there is a lack of a comprehensive benchmark for VLMs to evaluate HRI understanding. To address this gap, we introduce HRScene, a novel unified benchmark for HRI understanding with rich scenes. HRScene incorporates 25 real-world datasets and 2 synthetic diagnostic datasets with resolutions ranging from 1,024 $\times$ 1,024 to 35,503 $\times$ 26,627. HRScene is collected and re-annotated by 10 graduate-level annotators, covering 25 scenarios, ranging from microscopic to radiology images, street views, long-range pictures, and telescope images. It includes HRIs of real-world objects, scanned documents, and composite multi-image. The two diagnostic evaluation datasets are synthesized by combining the target image with the gold answer and distracting images in different orders, assessing how well models utilize regions in HRI. We conduct extensive experiments involving 28 VLMs, including Gemini 2.0 Flash and GPT-4o. Experiments on HRScene show that current VLMs achieve an average accuracy of around 50% on real-world tasks, revealing significant gaps in HRI understanding. Results on synthetic datasets reveal that VLMs struggle to effectively utilize HRI regions, showing significant Regional Divergence and lost-in-middle, shedding light on future research.

HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?

TL;DR

HRScene introduces a unified benchmark for high-resolution image understanding, compiling 25 real-world datasets plus 2 diagnostic datasets with resolutions spanning from

. By evaluating 28 Vision-Language Models, the study reveals that current VLMs achieve roughly

accuracy on real-world HRIs and exhibit distinct failure modes, notably Regional Divergence in large images and a Lost-in-the-Middle effect in diagnostic sub-tasks. The work provides a detailed data-collection and annotation protocol, a synthetic diagnostic suite to probe region usage, and extensive analyses on model size, global-local perception trade-offs, and multi-image composition. The findings establish clear directions for future research, including improved HRI processors, better region-aware reasoning, and robust handling of multi-image high-resolution inputs, with HRScene serving as an ongoing leaderboard for fair cross-model comparisons.

Abstract

1,024 to 35,503

26,627. HRScene is collected and re-annotated by 10 graduate-level annotators, covering 25 scenarios, ranging from microscopic to radiology images, street views, long-range pictures, and telescope images. It includes HRIs of real-world objects, scanned documents, and composite multi-image. The two diagnostic evaluation datasets are synthesized by combining the target image with the gold answer and distracting images in different orders, assessing how well models utilize regions in HRI. We conduct extensive experiments involving 28 VLMs, including Gemini 2.0 Flash and GPT-4o. Experiments on HRScene show that current VLMs achieve an average accuracy of around 50% on real-world tasks, revealing significant gaps in HRI understanding. Results on synthetic datasets reveal that VLMs struggle to effectively utilize HRI regions, showing significant Regional Divergence and lost-in-middle, shedding light on future research.

HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?

TL;DR

Abstract

HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)