Table of Contents
Fetching ...

RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model

Congcong Wen, Yiting Lin, Xiaokang Qu, Nan Li, Yong Liao, Hui Lin, Xiang Li

TL;DR

The paper tackles the gap in remote sensing vision-language models which struggle to incorporate external knowledge for complex reasoning. It introduces the Remote Sensing World Knowledge (RSWK) dataset, pairing high-resolution imagery with both remote sensing domain knowledge and world knowledge for 14,141 landmarks across 175 countries, and proposes RS-RAG, a Retrieval-Augmented Generation framework. RS-RAG builds a Multi-Modal Knowledge Vector Database using CLIP-based encodings and performs knowledge retrieval, fusion, and knowledge-conditioned prompting to guide a vision-language model. Across image captioning, image classification, and visual question answering, RS-RAG significantly outperforms state-of-the-art baselines, demonstrating the value of integrating external knowledge for more accurate, context-rich, and interpretable remote sensing VLM outputs.

Abstract

Recent progress in VLMs has demonstrated impressive capabilities across a variety of tasks in the natural image domain. Motivated by these advancements, the remote sensing community has begun to adopt VLMs for remote sensing vision-language tasks, including scene understanding, image captioning, and visual question answering. However, existing remote sensing VLMs typically rely on closed-set scene understanding and focus on generic scene descriptions, yet lack the ability to incorporate external knowledge. This limitation hinders their capacity for semantic reasoning over complex or context-dependent queries that involve domain-specific or world knowledge. To address these challenges, we first introduced a multimodal Remote Sensing World Knowledge (RSWK) dataset, which comprises high-resolution satellite imagery and detailed textual descriptions for 14,141 well-known landmarks from 175 countries, integrating both remote sensing domain knowledge and broader world knowledge. Building upon this dataset, we proposed a novel Remote Sensing Retrieval-Augmented Generation (RS-RAG) framework, which consists of two key components. The Multi-Modal Knowledge Vector Database Construction module encodes remote sensing imagery and associated textual knowledge into a unified vector space. The Knowledge Retrieval and Response Generation module retrieves and re-ranks relevant knowledge based on image and/or text queries, and incorporates the retrieved content into a knowledge-augmented prompt to guide the VLM in producing contextually grounded responses. We validated the effectiveness of our approach on three representative vision-language tasks, including image captioning, image classification, and visual question answering, where RS-RAG significantly outperformed state-of-the-art baselines.

RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model

TL;DR

The paper tackles the gap in remote sensing vision-language models which struggle to incorporate external knowledge for complex reasoning. It introduces the Remote Sensing World Knowledge (RSWK) dataset, pairing high-resolution imagery with both remote sensing domain knowledge and world knowledge for 14,141 landmarks across 175 countries, and proposes RS-RAG, a Retrieval-Augmented Generation framework. RS-RAG builds a Multi-Modal Knowledge Vector Database using CLIP-based encodings and performs knowledge retrieval, fusion, and knowledge-conditioned prompting to guide a vision-language model. Across image captioning, image classification, and visual question answering, RS-RAG significantly outperforms state-of-the-art baselines, demonstrating the value of integrating external knowledge for more accurate, context-rich, and interpretable remote sensing VLM outputs.

Abstract

Recent progress in VLMs has demonstrated impressive capabilities across a variety of tasks in the natural image domain. Motivated by these advancements, the remote sensing community has begun to adopt VLMs for remote sensing vision-language tasks, including scene understanding, image captioning, and visual question answering. However, existing remote sensing VLMs typically rely on closed-set scene understanding and focus on generic scene descriptions, yet lack the ability to incorporate external knowledge. This limitation hinders their capacity for semantic reasoning over complex or context-dependent queries that involve domain-specific or world knowledge. To address these challenges, we first introduced a multimodal Remote Sensing World Knowledge (RSWK) dataset, which comprises high-resolution satellite imagery and detailed textual descriptions for 14,141 well-known landmarks from 175 countries, integrating both remote sensing domain knowledge and broader world knowledge. Building upon this dataset, we proposed a novel Remote Sensing Retrieval-Augmented Generation (RS-RAG) framework, which consists of two key components. The Multi-Modal Knowledge Vector Database Construction module encodes remote sensing imagery and associated textual knowledge into a unified vector space. The Knowledge Retrieval and Response Generation module retrieves and re-ranks relevant knowledge based on image and/or text queries, and incorporates the retrieved content into a knowledge-augmented prompt to guide the VLM in producing contextually grounded responses. We validated the effectiveness of our approach on three representative vision-language tasks, including image captioning, image classification, and visual question answering, where RS-RAG significantly outperformed state-of-the-art baselines.

Paper Structure

This paper contains 18 sections, 11 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: The construction process of the Remote Sensing World Knowledge (RSWK) dataset begins with collecting landmark data from around the world, followed by extracting geographic information to pinpoint precise location coordinates. Using these coordinates, remote sensing images are acquired, which are further standardized through image processing techniques. Corresponding remote sensing expert knowledge, such as surface temperature, climate, atmospheric conditions, and spectral coefficients, is also included. Additionally, world knowledge is retrieved from online resources, providing detailed background information about the landmarks, including historical context, cultural significance, and major events. This combined information is structured into organized attributes to align image and text data, forming a final multimodal dataset. The resulting RSWK dataset integrates high-resolution images with extensive remote sensing and world knowledge, enabling advanced semantic understanding in remote sensing applications.
  • Figure 2: Overview of the RSWK dataset. (a) Global distribution of landmarks used in the dataset, with color indicating the number of landmarks per country. (b) Statistical summaries of landmark counts across the top 100 countries (left) and the top 15 most frequent landmark categories (right). (c) A specific example from the RSWK dataset, showcasing the Sydney Opera House, including its satellite imagery, remote sensing domain knowledge, and structured world knowledge.
  • Figure 3: Overview of the proposed Remote Sensing Retrieval-Augmented Generation (RS-RAG) model. It consists of two main processes: (a) The Multi-Modal Knowledge Vector Database Construction module encodes remote sensing imagery and domain/world knowledge into a unified vector space via image and text encoders, enabling efficient cross-modal retrieval. (b) The Knowledge Retrieval and Response Generation module retrieves top-k relevant knowledge based on image and/or textual queries, and re-ranks the results for better relevance. Retrieved knowledge is fused into the prompt through Knowledge-Conditioned Context Fusion, guiding the Vision-Language Model (VLM) to generate Knowledge-Grounded Responses. The RS-RAG model supports multiple downstream tasks such as Image Captioning, Image Classification, and Visual Question Answering, as demonstrated in the bottom section.
  • Figure 4: Qualitative results of image captioning on remote sensing imagery of the Great Seto Bridge. Text in red indicates the recognized landmark name; purple highlights retrieved world knowledge, such as historical, geographic, or cultural facts; and green denotes domain-specific knowledge, including spectral indices, land cover, and ALBEDO values.
  • Figure 5: Qualitative results of baseline models and our RS-RAG model on the image classification task using the RSWK dataset.
  • ...and 1 more figures