ImageRAG: Enhancing Ultra High Resolution Remote Sensing Imagery Analysis with ImageRAG
Zilun Zhang, Haozhan Shen, Tiancheng Zhao, Zian Guan, Bin Chen, Yuhao Wang, Xu Jia, Yuxiang Cai, Yongheng Shang, Jianwei Yin
TL;DR
The paper tackles the difficulty of applying Remote Sensing Multimodal LLMs to Ultra High Resolution RSI, where either resizing loses tiny targets or token limits prevent full-image reasoning. It introduces ImageRAG, a training-free Retrieval-Augmented Generation framework that splits the task into a retrieval stage (selecting relevant visual cues from a UHR image using patch division, question analysis, and multi-modal retrieval) and a generation stage (a visual-cue-aware MLLM trained with Zoom4K and VQA10K data). Key contributions include a detailed patch-division strategy, a two-path retrieval mechanism (fast and slow) with dual vector databases (LRSD and CRSD), and a fine-tuned MLLM capable of leveraging visual cues; experiments on MME-RealWorld-RS and MME-RealWorld-Lite-RS show notable improvements in Regular VQA and Inferring VQA tasks, as well as robust visual-cue retrieval. The framework demonstrates practical performance gains and provides a cookbook for adapting ImageRAG to other image modalities and domains, highlighting its potential to enable scalable, low-training RS analysis with big imagery such as $100{,}000 \times 100{,}000$ pixel scenes.
Abstract
Ultra High Resolution (UHR) remote sensing imagery (RSI) (e.g. 100,000 $\times$ 100,000 pixels or more) poses a significant challenge for current Remote Sensing Multimodal Large Language Models (RSMLLMs). If choose to resize the UHR image to standard input image size, the extensive spatial and contextual information that UHR images contain will be neglected. Otherwise, the original size of these images often exceeds the token limits of standard RSMLLMs, making it difficult to process the entire image and capture long-range dependencies to answer the query based on the abundant visual context. In this paper, we introduce ImageRAG for RS, a training-free framework to address the complexities of analyzing UHR remote sensing imagery. By transforming UHR remote sensing image analysis task to image's long context selection task, we design an innovative image contextual retrieval mechanism based on the Retrieval-Augmented Generation (RAG) technique, denoted as ImageRAG. ImageRAG's core innovation lies in its ability to selectively retrieve and focus on the most relevant portions of the UHR image as visual contexts that pertain to a given query. Fast path and slow path are proposed in this framework to handle this task efficiently and effectively. ImageRAG allows RSMLLMs to manage extensive context and spatial information from UHR RSI, ensuring the analysis is both accurate and efficient. Codebase will be released in https://github.com/om-ai-lab/ImageRAG
