INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

Yiwei Ma; Zhibin Wang; Xiaoshuai Sun; Weihuang Lin; Qiang Zhou; Jiayi Ji; Rongrong Ji

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

Yiwei Ma, Zhibin Wang, Xiaoshuai Sun, Weihuang Lin, Qiang Zhou, Jiayi Ji, Rongrong Ji

TL;DR

INF-LLaVA tackles the challenge of high-resolution perception in multimodal LLMs by introducing two modules: Dual-perspective Cropping Module (DCM) and Dual-perspective Enhancement Module (DEM). DCM crops images from local and global perspectives to preserve fine details and broader context, while DEM enables efficient fusion of these dual features without prohibitive cross-attention at full resolution. Built on a CLIP-ViT-L/14 encoder and a LLaMA3-8B LLM, INF-LLaVA achieves state-of-the-art results across multiple vision-language benchmarks and demonstrates notable efficiency benefits. The work provides a practical path toward robust high-resolution perception in MLLMs and releases code and pretrained models for broader adoption.

Abstract

With advancements in data availability and computing resources, Multimodal Large Language Models (MLLMs) have showcased capabilities across various fields. However, the quadratic complexity of the vision encoder in MLLMs constrains the resolution of input images. Most current approaches mitigate this issue by cropping high-resolution images into smaller sub-images, which are then processed independently by the vision encoder. Despite capturing sufficient local details, these sub-images lack global context and fail to interact with one another. To address this limitation, we propose a novel MLLM, INF-LLaVA, designed for effective high-resolution image perception. INF-LLaVA incorporates two innovative components. First, we introduce a Dual-perspective Cropping Module (DCM), which ensures that each sub-image contains continuous details from a local perspective and comprehensive information from a global perspective. Second, we introduce Dual-perspective Enhancement Module (DEM) to enable the mutual enhancement of global and local features, allowing INF-LLaVA to effectively process high-resolution images by simultaneously capturing detailed local information and comprehensive global context. Extensive ablation studies validate the effectiveness of these components, and experiments on a diverse set of benchmarks demonstrate that INF-LLaVA outperforms existing MLLMs. Code and pretrained model are available at https://github.com/WeihuangLin/INF-LLaVA.

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

TL;DR

Abstract

Paper Structure (29 sections, 27 equations, 8 figures, 6 tables)

This paper contains 29 sections, 27 equations, 8 figures, 6 tables.

Introduction
Related Work
Large Language Models (LLMs)
Multimodal Large Language Models (MLLMs)
High-resolution MLLMs
Preliminary
Methods
Overview
Dual-perspective Cropping Module
Local-perspective Cropping
Global-perspective Cropping
Dual-perspective Enhancement Module
Sub-features Combination
Global-Perspective Enhancement
Local-Perspective Enhancement
...and 14 more sections

Figures (8)

Figure 1: Comparison between existing high-resolution MLLMs and INF-LLaVA. LR and HR abbreviate low-resolution and high-resolution, respectively. Zoom in for optimal viewing.
Figure 2: Overview of the proposed INF-LLaVA framework. To address the limitations of processing high-resolution images directly with the pretrained CLIP-ViT encoder, Dual-perspective Cropping Module (DCM) segments the high-resolution image into sub-images from both local and global perspectives. Each sub-image is then individually passed through the CLIP-ViT encoder to extract distinct visual features. These features are subsequently recombined based 2D positional priors, resulting in a comprehensive set of high-resolution local and global features. Dual-perspective Enhancement Module (DEM) is introduced to facilitate effective interaction between the local and global features. Next, an average pooling layer is applied to reduce the number of visual tokens, enhancing computational efficiency and speeding up both training and inference processes. Finally, the refined visual tokens are concatenated with textual tokens of the instruction and fed into the LLM, which generates responses sequentially, token by token.
Figure 3: Illustration of the integration of local and global perspective sub-features using two-dimensional positional priors. This approach ensures a seamless combination of detailed local information with broader contextual insights, maintaining spatial coherence and enhancing the overall representation of the high-resolution image.
Figure 4: Illustration of the proposed Dual-perspective Enhancement Module (DEM), highlighting its innovative approach to efficiently integrating and enhancing local and global sub-features for superior image understanding.
Figure 5: Chat comparison using different image resolutions. Certain regions of the input high-resolution images are zoomed in for enhanced visualization.
...and 3 more figures

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

TL;DR

Abstract

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (8)