Table of Contents
Fetching ...

MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding

Fan Yang, Kaihao Zhang

TL;DR

High-resolution image understanding remains challenging for multimodal LLMs due to object fragmentation across crops and sensitivity to crop resolution. We introduce MRD, a training-free framework that fuses multi-resolution semantic similarity with an open-vocabulary detector to localize targets globally and accurately. The approach consists of Multi-resolution Semantic Fusion and Open-vocabulary Detector Enhancement, which together correct semantic biases and provide robust localization that guides efficient crop retrieval. Experiments on V* and HR-Bench across multiple MLLMs show MRD delivering state-of-the-art gains, especially for single-object tasks, and demonstrate strong generalization and practical impact for high-resolution perception in multimodal systems.

Abstract

Understanding high-resolution images remains a significant challenge for multimodal large language models (MLLMs). Recent study address this issue by dividing the image into smaller crops and computing the semantic similarity between each crop and a query using a pretrained retrieval-augmented generation (RAG) model. The most relevant crops are then selected to localize the target object and suppress irrelevant information. However, such crop-based processing can fragment complete objects across multiple crops, thereby disrupting the computation of semantic similarity. In our experiments, we find that image crops of objects with different sizes are better handled at different resolutions. Based on this observation, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework for high-resolution image understanding. To address the issue of semantic similarity bias caused by objects being split across different image crops, we propose a multi-resolution semantic fusion method, which integrates semantic similarity maps obtained at different resolutions to produce more accurate semantic information and preserve the integrity of target objects. Furthermore, to achieve direct localization of target objects at a global scale, we introduce an open-vocalbulary object detection (OVD) model that identifies object regions using a sliding-window approach.Experiments on high-resolution image understanding benchmarks using different MLLMs demonstrate the effectiveness of our approach.

MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding

TL;DR

High-resolution image understanding remains challenging for multimodal LLMs due to object fragmentation across crops and sensitivity to crop resolution. We introduce MRD, a training-free framework that fuses multi-resolution semantic similarity with an open-vocabulary detector to localize targets globally and accurately. The approach consists of Multi-resolution Semantic Fusion and Open-vocabulary Detector Enhancement, which together correct semantic biases and provide robust localization that guides efficient crop retrieval. Experiments on V* and HR-Bench across multiple MLLMs show MRD delivering state-of-the-art gains, especially for single-object tasks, and demonstrate strong generalization and practical impact for high-resolution perception in multimodal systems.

Abstract

Understanding high-resolution images remains a significant challenge for multimodal large language models (MLLMs). Recent study address this issue by dividing the image into smaller crops and computing the semantic similarity between each crop and a query using a pretrained retrieval-augmented generation (RAG) model. The most relevant crops are then selected to localize the target object and suppress irrelevant information. However, such crop-based processing can fragment complete objects across multiple crops, thereby disrupting the computation of semantic similarity. In our experiments, we find that image crops of objects with different sizes are better handled at different resolutions. Based on this observation, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework for high-resolution image understanding. To address the issue of semantic similarity bias caused by objects being split across different image crops, we propose a multi-resolution semantic fusion method, which integrates semantic similarity maps obtained at different resolutions to produce more accurate semantic information and preserve the integrity of target objects. Furthermore, to achieve direct localization of target objects at a global scale, we introduce an open-vocalbulary object detection (OVD) model that identifies object regions using a sliding-window approach.Experiments on high-resolution image understanding benchmarks using different MLLMs demonstrate the effectiveness of our approach.

Paper Structure

This paper contains 26 sections, 11 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Overview of the proposed Multi-resolution Retrieval-Detection framework, which uses RAG and OVD to obtain semantic similarity map and detection confidence map respectively. By integrating the two, the target objects can be localized more accurately.
  • Figure 2: Setting the resolution of image crops to 112 causes complete objects to be split across different regions, which disrupts the semantic information of the target objects.
  • Figure 3: The effect of the resolution of retrieved image crops on model performance. Attribute and Spatial represent the attribute recognition and spatial reasoning in $V^{*}$ Bench.
  • Figure 4: Detailed information of our propsoed $\textit{MRD}$. First, We use VisRAG with different resolution of image crops to obtain multi-resolution semantic similarity map. We then employ an open-set object detection model, LLMDet, to localize the target objects extracted from the query within the high-resolution image using a sliding-window approach, yielding a global detection confidence map. Finally, the obtained multi-resolution semantic similarity map is linearly fused with the detection confidence map, and the fused scores are used to guide the subsequent search to select image crops containing the target objects.
  • Figure 5: Visualization of the Effects of Different Modules in MRD. Upper: Visualization of the Effects of the Multi-resolution Semantic Fusion Method. Lower: Visualization of the Effects of the Multi-resolution Semantic Fusion Method
  • ...and 6 more figures