Table of Contents
Fetching ...

ImageRAG: Enhancing Ultra High Resolution Remote Sensing Imagery Analysis with ImageRAG

Zilun Zhang, Haozhan Shen, Tiancheng Zhao, Zian Guan, Bin Chen, Yuhao Wang, Xu Jia, Yuxiang Cai, Yongheng Shang, Jianwei Yin

TL;DR

The paper tackles the difficulty of applying Remote Sensing Multimodal LLMs to Ultra High Resolution RSI, where either resizing loses tiny targets or token limits prevent full-image reasoning. It introduces ImageRAG, a training-free Retrieval-Augmented Generation framework that splits the task into a retrieval stage (selecting relevant visual cues from a UHR image using patch division, question analysis, and multi-modal retrieval) and a generation stage (a visual-cue-aware MLLM trained with Zoom4K and VQA10K data). Key contributions include a detailed patch-division strategy, a two-path retrieval mechanism (fast and slow) with dual vector databases (LRSD and CRSD), and a fine-tuned MLLM capable of leveraging visual cues; experiments on MME-RealWorld-RS and MME-RealWorld-Lite-RS show notable improvements in Regular VQA and Inferring VQA tasks, as well as robust visual-cue retrieval. The framework demonstrates practical performance gains and provides a cookbook for adapting ImageRAG to other image modalities and domains, highlighting its potential to enable scalable, low-training RS analysis with big imagery such as $100{,}000 \times 100{,}000$ pixel scenes.

Abstract

Ultra High Resolution (UHR) remote sensing imagery (RSI) (e.g. 100,000 $\times$ 100,000 pixels or more) poses a significant challenge for current Remote Sensing Multimodal Large Language Models (RSMLLMs). If choose to resize the UHR image to standard input image size, the extensive spatial and contextual information that UHR images contain will be neglected. Otherwise, the original size of these images often exceeds the token limits of standard RSMLLMs, making it difficult to process the entire image and capture long-range dependencies to answer the query based on the abundant visual context. In this paper, we introduce ImageRAG for RS, a training-free framework to address the complexities of analyzing UHR remote sensing imagery. By transforming UHR remote sensing image analysis task to image's long context selection task, we design an innovative image contextual retrieval mechanism based on the Retrieval-Augmented Generation (RAG) technique, denoted as ImageRAG. ImageRAG's core innovation lies in its ability to selectively retrieve and focus on the most relevant portions of the UHR image as visual contexts that pertain to a given query. Fast path and slow path are proposed in this framework to handle this task efficiently and effectively. ImageRAG allows RSMLLMs to manage extensive context and spatial information from UHR RSI, ensuring the analysis is both accurate and efficient. Codebase will be released in https://github.com/om-ai-lab/ImageRAG

ImageRAG: Enhancing Ultra High Resolution Remote Sensing Imagery Analysis with ImageRAG

TL;DR

The paper tackles the difficulty of applying Remote Sensing Multimodal LLMs to Ultra High Resolution RSI, where either resizing loses tiny targets or token limits prevent full-image reasoning. It introduces ImageRAG, a training-free Retrieval-Augmented Generation framework that splits the task into a retrieval stage (selecting relevant visual cues from a UHR image using patch division, question analysis, and multi-modal retrieval) and a generation stage (a visual-cue-aware MLLM trained with Zoom4K and VQA10K data). Key contributions include a detailed patch-division strategy, a two-path retrieval mechanism (fast and slow) with dual vector databases (LRSD and CRSD), and a fine-tuned MLLM capable of leveraging visual cues; experiments on MME-RealWorld-RS and MME-RealWorld-Lite-RS show notable improvements in Regular VQA and Inferring VQA tasks, as well as robust visual-cue retrieval. The framework demonstrates practical performance gains and provides a cookbook for adapting ImageRAG to other image modalities and domains, highlighting its potential to enable scalable, low-training RS analysis with big imagery such as pixel scenes.

Abstract

Ultra High Resolution (UHR) remote sensing imagery (RSI) (e.g. 100,000 100,000 pixels or more) poses a significant challenge for current Remote Sensing Multimodal Large Language Models (RSMLLMs). If choose to resize the UHR image to standard input image size, the extensive spatial and contextual information that UHR images contain will be neglected. Otherwise, the original size of these images often exceeds the token limits of standard RSMLLMs, making it difficult to process the entire image and capture long-range dependencies to answer the query based on the abundant visual context. In this paper, we introduce ImageRAG for RS, a training-free framework to address the complexities of analyzing UHR remote sensing imagery. By transforming UHR remote sensing image analysis task to image's long context selection task, we design an innovative image contextual retrieval mechanism based on the Retrieval-Augmented Generation (RAG) technique, denoted as ImageRAG. ImageRAG's core innovation lies in its ability to selectively retrieve and focus on the most relevant portions of the UHR image as visual contexts that pertain to a given query. Fast path and slow path are proposed in this framework to handle this task efficiently and effectively. ImageRAG allows RSMLLMs to manage extensive context and spatial information from UHR RSI, ensuring the analysis is both accurate and efficient. Codebase will be released in https://github.com/om-ai-lab/ImageRAG

Paper Structure

This paper contains 60 sections, 18 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: An example of challenging VQA task that requires analyzing small targets in a high-resolution image. Models such as GeoChat, InternVL2.5, and $V^{*}$ failed to answer. InternVL2.5 with the aid of ImageRAG and InternVL2.5 with human-provided visual cue can answer the question correctly.
  • Figure 2: The performance of MLLMs in the MME-RealWorld-RS benchmark's remote sensing subset. The specified image resolutions for model input are listed in the end. “DR” represents the Dynamic Resolution technique, with 6 or 12 indicating the maximum number of tiles obtained through Dynamic Resolution. In general, model performance tends to improve with increased input image resolution (DR can be seen as an enhancement of input image resolution)
  • Figure 3: Visualization of Subtasks from MME-RealWorld-RS Dataset and Statistics of the Images. Image examples are taken from the Appendix of MME-RealWorld mmerealworld Paper.
  • Figure 4: Distribution of ROI area ratios (ROI Area / Image Size)
  • Figure 5: Key Phrases Word Cloud
  • ...and 6 more figures