Table of Contents
Fetching ...

Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG

Wenbin Wang, Yongcheng Jing, Liang Ding, Yingjie Wang, Li Shen, Yong Luo, Bo Du, Dacheng Tao

TL;DR

This work tackles the challenge of high-resolution image perception in multimodal LLMs by reframing it as a long-context problem amenable to retrieval-augmented processing. The authors introduce RAP, a training-free framework that retrieves and fuses relevant image crops while preserving spatial context through a Spatial-Awareness Layout and adaptively selects the number of crops with Retrieved-Exploration Search (RE-Search). Through extensive experiments on HR benchmarks and a general multimodal suite, RAP demonstrates substantial improvements across model sizes and tasks, especially for spatially demanding perception, while maintaining efficiency advantages over prior HR methods. The approach offers practical impact for scaling HR perception in MLLMs, enabling finer-grained understanding without heavy tokenization or retraining, and points to future work on more aggressive token-compression strategies.

Abstract

High-resolution (HR) image perception remains a key challenge in multimodal large language models (MLLMs). To overcome the limitations of existing methods, this paper shifts away from prior dedicated heuristic approaches and revisits the most fundamental idea to HR perception by enhancing the long-context capability of MLLMs, driven by recent advances in long-context techniques like retrieval-augmented generation (RAG) for general LLMs. Towards this end, this paper presents the first study exploring the use of RAG to address HR perception challenges. Specifically, we propose Retrieval-Augmented Perception (RAP), a training-free framework that retrieves and fuses relevant image crops while preserving spatial context using the proposed Spatial-Awareness Layout. To accommodate different tasks, the proposed Retrieved-Exploration Search (RE-Search) dynamically selects the optimal number of crops based on model confidence and retrieval scores. Experimental results on HR benchmarks demonstrate the significant effectiveness of RAP, with LLaVA-v1.5-13B achieving a 43% improvement on $V^*$ Bench and 19% on HR-Bench.

Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG

TL;DR

This work tackles the challenge of high-resolution image perception in multimodal LLMs by reframing it as a long-context problem amenable to retrieval-augmented processing. The authors introduce RAP, a training-free framework that retrieves and fuses relevant image crops while preserving spatial context through a Spatial-Awareness Layout and adaptively selects the number of crops with Retrieved-Exploration Search (RE-Search). Through extensive experiments on HR benchmarks and a general multimodal suite, RAP demonstrates substantial improvements across model sizes and tasks, especially for spatially demanding perception, while maintaining efficiency advantages over prior HR methods. The approach offers practical impact for scaling HR perception in MLLMs, enabling finer-grained understanding without heavy tokenization or retraining, and points to future work on more aggressive token-compression strategies.

Abstract

High-resolution (HR) image perception remains a key challenge in multimodal large language models (MLLMs). To overcome the limitations of existing methods, this paper shifts away from prior dedicated heuristic approaches and revisits the most fundamental idea to HR perception by enhancing the long-context capability of MLLMs, driven by recent advances in long-context techniques like retrieval-augmented generation (RAG) for general LLMs. Towards this end, this paper presents the first study exploring the use of RAG to address HR perception challenges. Specifically, we propose Retrieval-Augmented Perception (RAP), a training-free framework that retrieves and fuses relevant image crops while preserving spatial context using the proposed Spatial-Awareness Layout. To accommodate different tasks, the proposed Retrieved-Exploration Search (RE-Search) dynamically selects the optimal number of crops based on model confidence and retrieval scores. Experimental results on HR benchmarks demonstrate the significant effectiveness of RAP, with LLaVA-v1.5-13B achieving a 43% improvement on Bench and 19% on HR-Bench.

Paper Structure

This paper contains 30 sections, 6 equations, 8 figures, 8 tables, 2 algorithms.

Figures (8)

  • Figure 1: (a) Overview of the proposed Retrieval-Augmented Perception (RAP) framework, which divides the HR images into image crops for retrieval, followed by layout reconstruction to retain the spatial information; (b) Performance comparison of MLLMs across various model sizes, demonstrating consistent improvements with our RAP on HR-Bench.
  • Figure 2: The effect of the number of retrieved image crops on model performance. FSP and FCP represent the fine-grained single-instance perception tasks and fine-grained cross-instance perception tasks, respectively.
  • Figure 3: Detailed illustration of our proposed RAP with a running example. We firstly divide HR image into multiple image crops and compute the similarity score $s$ between the query and image corps to retrieve the key image crops. We design a simple and efficient method called Spatial-Awareness Layout to maintain the relative positional relationships of the image crops. Since the number of image crops is highly sensitive to the task type, we propose RE-Search, which identifies the optimal $K$ based on the model's confidence scores and retrieval scores.
  • Figure 4: Analyzing the distribution for selecting $K$ using our RAP. (a) The distribution of $K$ selected by RAP, where "All" denotes the total number of image crops in the original image. (b) The distribution of $K$ corresponding to different task types.
  • Figure 5: Performance vs. RE-Search steps on HR-Bench 8K. (a) Fine-grained Single-instance Perception (FSP); (b) Fine-grained Cross-instance Perception (FCP); (c) Overall Performance.
  • ...and 3 more figures