Table of Contents
Fetching ...

HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models

Runhui Huang, Xinpeng Ding, Chunwei Wang, Jianhua Han, Yulong Liu, Hengshuang Zhao, Hang Xu, Lu Hou, Wei Zhang, Xiaodan Liang

TL;DR

HiRes-LLaVA tackles the fragmentation problem in high-resolution LVLMs caused by input slicing by introducing two cores: the SliceRestore Adapter (SRA) and the Self-Mining Sampler (SMS). The architecture uses a dual-branch vision encoder to capture global context from a low-resolution overview while preserving fine-grained detail from high-resolution slices, with SRA restoring cross-patch information and SMS compressing tokens efficiently. The authors also introduce EntityGrid-QA, a synthetic benchmark focused on edge vs center fragmentation to quantify model robustness to slice boundaries. Across nine public benchmarks and EntityGrid-QA, HiRes-LLaVA achieves strong performance, especially on document-oriented tasks, while reducing training overhead relative to baselines. Collectively, the work offers a practical path for robust high-resolution LVLMs and provides a targeted metric for evaluating fragmentation handling.

Abstract

High-resolution inputs enable Large Vision-Language Models (LVLMs) to discern finer visual details, enhancing their comprehension capabilities. To reduce the training and computation costs caused by high-resolution input, one promising direction is to use sliding windows to slice the input into uniform patches, each matching the input size of the well-trained vision encoder. Although efficient, this slicing strategy leads to the fragmentation of original input, i.e., the continuity of contextual information and spatial geometry is lost across patches, adversely affecting performance in cross-patch context perception and position-specific tasks. To overcome these shortcomings, we introduce HiRes-LLaVA, a novel framework designed to efficiently process any size of high-resolution input without altering the original contextual and geometric information. HiRes-LLaVA comprises two innovative components: (i) a SliceRestore adapter that reconstructs sliced patches into their original form, efficiently extracting both global and local features via down-up-sampling and convolution layers, and (ii) a Self-Mining Sampler to compresses the vision tokens based on themselves, preserving the original context and positional information while reducing training overhead. To assess the ability of handling context fragmentation, we construct a new benchmark, EntityGrid-QA, consisting of edge-related and position-related tasks. Our comprehensive experiments demonstrate the superiority of HiRes-LLaVA on both existing public benchmarks and on EntityGrid-QA, particularly on document-oriented tasks, establishing new standards for handling high-resolution inputs.

HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models

TL;DR

HiRes-LLaVA tackles the fragmentation problem in high-resolution LVLMs caused by input slicing by introducing two cores: the SliceRestore Adapter (SRA) and the Self-Mining Sampler (SMS). The architecture uses a dual-branch vision encoder to capture global context from a low-resolution overview while preserving fine-grained detail from high-resolution slices, with SRA restoring cross-patch information and SMS compressing tokens efficiently. The authors also introduce EntityGrid-QA, a synthetic benchmark focused on edge vs center fragmentation to quantify model robustness to slice boundaries. Across nine public benchmarks and EntityGrid-QA, HiRes-LLaVA achieves strong performance, especially on document-oriented tasks, while reducing training overhead relative to baselines. Collectively, the work offers a practical path for robust high-resolution LVLMs and provides a targeted metric for evaluating fragmentation handling.

Abstract

High-resolution inputs enable Large Vision-Language Models (LVLMs) to discern finer visual details, enhancing their comprehension capabilities. To reduce the training and computation costs caused by high-resolution input, one promising direction is to use sliding windows to slice the input into uniform patches, each matching the input size of the well-trained vision encoder. Although efficient, this slicing strategy leads to the fragmentation of original input, i.e., the continuity of contextual information and spatial geometry is lost across patches, adversely affecting performance in cross-patch context perception and position-specific tasks. To overcome these shortcomings, we introduce HiRes-LLaVA, a novel framework designed to efficiently process any size of high-resolution input without altering the original contextual and geometric information. HiRes-LLaVA comprises two innovative components: (i) a SliceRestore adapter that reconstructs sliced patches into their original form, efficiently extracting both global and local features via down-up-sampling and convolution layers, and (ii) a Self-Mining Sampler to compresses the vision tokens based on themselves, preserving the original context and positional information while reducing training overhead. To assess the ability of handling context fragmentation, we construct a new benchmark, EntityGrid-QA, consisting of edge-related and position-related tasks. Our comprehensive experiments demonstrate the superiority of HiRes-LLaVA on both existing public benchmarks and on EntityGrid-QA, particularly on document-oriented tasks, establishing new standards for handling high-resolution inputs.
Paper Structure (23 sections, 6 equations, 9 figures, 14 tables)

This paper contains 23 sections, 6 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Illustration of the fragmentation issue.(a) Slicing input: Slicing-based LVLMs, such as LLaVA-Next liu2024llavanext, can fragment objects located at the edges of slices, leading to errors in model understanding. (b) Performance comparison: On our EntityGrid-QA benchmark, slicing-based methods show a significant performance gap between fragment and non-fragment inputs. Our method effectively handles both cases, achieving a smaller performance gap similar to non-slicing approaches.
  • Figure 2: Overall framework of HiRes-LLaVA. The vision encoding consists of two branches: one for low-resolution images processed by the pre-trained vision encoder to extract global features, and another dividing high-resolution images into multiple slices to capture fine-grained details. (a) SliceRestore Adapter aims to address the Context Fragmentation issue, it restores sliced features into a whole feature by capturing both local and global information, then splits the whole feature back into slices. (b) Self-Mining Sampler compresses visual token numbers to reduce computation and memory costs by using downsampled features as queries and the original features as keys and values. Both low-resolution image input and each high-resolution slice are compressed by the same self-mining sampler.
  • Figure 3: Construction process of EntityGrid-QA benchmark. There are three steps: (a) Entity Sampling. Select one or two entities from the pre-defined entity set; (b) Image Generation. Put the selected entities in one position sampled from the nine pre-defined positions of the blank image, we can obtain the generated images. Note that the dash and solid lines in (b) are for illustration purposes only, and not presented to models. (c) QA pairs Generation. Based on the generated images, entity category and positions, we can automatically generate the question-answer pairs (QAs).
  • Figure 4: Visualization comparison with the state-of-the-art methods. Dash lines are only illustrated for the slice clarify.
  • Figure 5: (a) Ablation on data efficiency of HiRes-LLaVA. We sample the training data mixture at ratios of 20%, 60%, and 100% and report the performance of our HiRes-LLaVA on seven benchmarks. (b) Data efficiency comparison with Q-former and our proposed self-mining sampler (SMS). The performance on 'Doc QA' is averaged from DocVQA, ChartQA and InfoVQA. The performance on 'General QA' is averaged from the other four benchmarks. Our SMS can use $40\%$ fewer data to achieve competitive performance compared with Q-former, indicating our method's efficiency. Note that both Q-former and our SMS apply one cross-attention block.
  • ...and 4 more figures