Table of Contents
Fetching ...

HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

Xianjie Liu, Yiman Hu, Yixiong Zou, Liang Wu, Jian Xu, Bo Zheng

TL;DR

The paper reveals that background interference is the primary bottleneck limiting high-resolution perception in multimodal LLMs. It proposes HiDe, a training-free framework comprising Token-wise Attention Decoupling (TAD) to purify key-information attention and Layout-Preserving Decoupling (LPD) to extract and reconstruct target regions while preserving spatial relations. HiDe achieves state-of-the-art results on HR-VQA benchmarks (V*Bench, HRBench4K, HRBench8K) with representative MLLMs and substantially reduces memory usage, addressing practical deployment concerns. The approach offers a scalable, plug-and-play enhancement to existing MLLMs, improving fine-grained visual understanding without costly fine-tuning or multi-pass reasoning. Overall, HiDe provides a principled, efficient solution for precise high-resolution visual reasoning in MLLMs, with broad implications for HR-VQA and related tasks.

Abstract

Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding tasks. However, their performance on high-resolution images remains suboptimal. While existing approaches often attribute this limitation to perceptual constraints and argue that MLLMs struggle to recognize small objects, leading them to use "zoom in" strategies for better detail, our analysis reveals a different cause: the main issue is not object size, but rather caused by complex background interference. We systematically analyze this "zoom in" operation through a series of decoupling experiments and propose the Hierarchical Decoupling Framework (HiDe), a training-free framework that uses Token-wise Attention Decoupling (TAD) to decouple the question tokens and identify the key information tokens, then leverages their attention weights to achieve precise alignment with the target visual regions. Subsequently, it employs Layout-Preserving Decoupling (LPD) to decouple these regions from the background and reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference. HiDe sets a new SOTA on V*Bench, HRBench4K, and HRBench8K, boosting Qwen2.5-VL 7B and InternVL3 8B to SOTA (92.1% and 91.6% on V*Bench), even surpassing RL methods. After optimization, HiDe uses 75% less memory than the previous training-free approach. Code is provided in https://github.com/Tennine2077/HiDe.

HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

TL;DR

The paper reveals that background interference is the primary bottleneck limiting high-resolution perception in multimodal LLMs. It proposes HiDe, a training-free framework comprising Token-wise Attention Decoupling (TAD) to purify key-information attention and Layout-Preserving Decoupling (LPD) to extract and reconstruct target regions while preserving spatial relations. HiDe achieves state-of-the-art results on HR-VQA benchmarks (V*Bench, HRBench4K, HRBench8K) with representative MLLMs and substantially reduces memory usage, addressing practical deployment concerns. The approach offers a scalable, plug-and-play enhancement to existing MLLMs, improving fine-grained visual understanding without costly fine-tuning or multi-pass reasoning. Overall, HiDe provides a principled, efficient solution for precise high-resolution visual reasoning in MLLMs, with broad implications for HR-VQA and related tasks.

Abstract

Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding tasks. However, their performance on high-resolution images remains suboptimal. While existing approaches often attribute this limitation to perceptual constraints and argue that MLLMs struggle to recognize small objects, leading them to use "zoom in" strategies for better detail, our analysis reveals a different cause: the main issue is not object size, but rather caused by complex background interference. We systematically analyze this "zoom in" operation through a series of decoupling experiments and propose the Hierarchical Decoupling Framework (HiDe), a training-free framework that uses Token-wise Attention Decoupling (TAD) to decouple the question tokens and identify the key information tokens, then leverages their attention weights to achieve precise alignment with the target visual regions. Subsequently, it employs Layout-Preserving Decoupling (LPD) to decouple these regions from the background and reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference. HiDe sets a new SOTA on V*Bench, HRBench4K, and HRBench8K, boosting Qwen2.5-VL 7B and InternVL3 8B to SOTA (92.1% and 91.6% on V*Bench), even surpassing RL methods. After optimization, HiDe uses 75% less memory than the previous training-free approach. Code is provided in https://github.com/Tennine2077/HiDe.

Paper Structure

This paper contains 26 sections, 8 equations, 12 figures, 7 tables, 1 algorithm.

Figures (12)

  • Figure 1: (a) Previous methods struggle to locate objects. (b) HiDe precisely locates objects and keeps relative positions. (c) HiDe outperforms previous training-free and beats the trained one.
  • Figure 2: Hierarchical decoupling framework for analyzing MLLM performance on high-resolution images. The details in gray blocks are shown in Fig. \ref{['fig:accuracy_vs_upscale_factor']}. Blue Fig. \ref{['fig:accuracy_comparison_vs_crop_out']}. Red Fig. \ref{['fig:analysis']}. Green Fig. \ref{['fig:method']}
  • Figure 3: Left: A contradictory example comparing the inference results of zoom-in and simple resolution upscaling at the same upscale factor. Right: Performance curves showing the impact of resolution scaling on two models across two tasks—Attributes for single-object tasks and Spatial for multi-object tasks.
  • Figure 4: Background Information Ablation Experiments. Left: Model accuracy increases as the mask ratio of background semantic information rises. Right: Model accuracy improves as the number of background tokens decreases. Each point represents the average accuracy over 10 steps.
  • Figure 5: (a, b) Visualization of attention maps. (a): Attention map from the first generated answer token, miss some target regions and has noise. (b): Attention maps for every input question token, accurately localizing target regions based on corresponding tokens. (c): Relative attention to the bounding box areas across the layers for Qwen2.5-VL and InternVL3. (d): Accuracies of different aggregate methods, Spatial Aggregate is the best strategy.
  • ...and 7 more figures