Table of Contents
Fetching ...

RS-MoE: A Vision-Language Model with Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering

Hui Lin, Danfeng Hong, Shuhang Ge, Chuyao Luo, Kai Jiang, Hao Jin, Congcong Wen

TL;DR

RS-MoE introduces a novel Mixture of Experts Vision-Language Model tailored for remote sensing image captioning (RSIC). It deploys an Instruction Router to generate task-specific prompts for multiple lightweight LLM experts, enabling theme, object, and relationship-focused captioning, while a two-stage, LoRA-enhanced training strategy addresses domain sparsity. The approach achieves state-of-the-art results on five RSIC benchmarks and demonstrates strong generalization on RSVQA datasets, with RS-MoE-1B matching the performance of much larger models at a fraction of the cost. This work significantly advances practical RS interpretation by delivering detailed, context-rich captions and robust multimodal reasoning in resource-constrained settings.

Abstract

Remote Sensing Image Captioning (RSIC) presents unique challenges and plays a critical role in applications. Traditional RSIC methods often struggle to produce rich and diverse descriptions. Recently, with advancements in VLMs, efforts have emerged to integrate these models into the remote sensing domain and to introduce descriptive datasets specifically designed to enhance VLM training. This paper proposes RS-MoE, a first Mixture of Expert based VLM specifically customized for remote sensing domain. Unlike traditional MoE models, the core of RS-MoE is the MoE Block, which incorporates a novel Instruction Router and multiple lightweight Large Language Models (LLMs) as expert models. The Instruction Router is designed to generate specific prompts tailored for each corresponding LLM, guiding them to focus on distinct aspects of the RSIC task. This design not only allows each expert LLM to concentrate on a specific subset of the task, thereby enhancing the specificity and accuracy of the generated captions, but also improves the scalability of the model by facilitating parallel processing of sub-tasks. Additionally, we present a two-stage training strategy for tuning our RS-MoE model to prevent performance degradation due to sparsity. We fine-tuned our model on the RSICap dataset using our proposed training strategy. Experimental results on the RSICap dataset, along with evaluations on other traditional datasets where no additional fine-tuning was applied, demonstrate that our model achieves state-of-the-art performance in generating precise and contextually relevant captions. Notably, our RS-MoE-1B variant achieves performance comparable to 13B VLMs, demonstrating the efficiency of our model design. Moreover, our model demonstrates promising generalization capabilities by consistently achieving state-of-the-art performance on the Remote Sensing Visual Question Answering (RSVQA) task.

RS-MoE: A Vision-Language Model with Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering

TL;DR

RS-MoE introduces a novel Mixture of Experts Vision-Language Model tailored for remote sensing image captioning (RSIC). It deploys an Instruction Router to generate task-specific prompts for multiple lightweight LLM experts, enabling theme, object, and relationship-focused captioning, while a two-stage, LoRA-enhanced training strategy addresses domain sparsity. The approach achieves state-of-the-art results on five RSIC benchmarks and demonstrates strong generalization on RSVQA datasets, with RS-MoE-1B matching the performance of much larger models at a fraction of the cost. This work significantly advances practical RS interpretation by delivering detailed, context-rich captions and robust multimodal reasoning in resource-constrained settings.

Abstract

Remote Sensing Image Captioning (RSIC) presents unique challenges and plays a critical role in applications. Traditional RSIC methods often struggle to produce rich and diverse descriptions. Recently, with advancements in VLMs, efforts have emerged to integrate these models into the remote sensing domain and to introduce descriptive datasets specifically designed to enhance VLM training. This paper proposes RS-MoE, a first Mixture of Expert based VLM specifically customized for remote sensing domain. Unlike traditional MoE models, the core of RS-MoE is the MoE Block, which incorporates a novel Instruction Router and multiple lightweight Large Language Models (LLMs) as expert models. The Instruction Router is designed to generate specific prompts tailored for each corresponding LLM, guiding them to focus on distinct aspects of the RSIC task. This design not only allows each expert LLM to concentrate on a specific subset of the task, thereby enhancing the specificity and accuracy of the generated captions, but also improves the scalability of the model by facilitating parallel processing of sub-tasks. Additionally, we present a two-stage training strategy for tuning our RS-MoE model to prevent performance degradation due to sparsity. We fine-tuned our model on the RSICap dataset using our proposed training strategy. Experimental results on the RSICap dataset, along with evaluations on other traditional datasets where no additional fine-tuning was applied, demonstrate that our model achieves state-of-the-art performance in generating precise and contextually relevant captions. Notably, our RS-MoE-1B variant achieves performance comparable to 13B VLMs, demonstrating the efficiency of our model design. Moreover, our model demonstrates promising generalization capabilities by consistently achieving state-of-the-art performance on the Remote Sensing Visual Question Answering (RSVQA) task.

Paper Structure

This paper contains 38 sections, 7 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Overview of the proposed RS-MoE model, which consists of four key components: the Image Encoder, the VLM Encoder, the LLM Block and the MoE Block. The MoE Block comprises an Instruction Router that dynamically generates task-specific prompts and three lightweight LLMs, which focus on different aspects of the captioning task. In the generated captions, shown in the top right corner of the figure, distinct colors represent each aspect: orange for the overall theme, purple for specific objects, and green for relationships between objects. RS-MoE is trained using a novel two-stage training strategy specifically designed for remote sensing image captioning. In Stage I (a), the VLM Encoder and the LLM Block are fine-tuned to initialize model weights specifically designed for the RSIC task. In Stage II (b), the MoE Block is fine-tuned to produce more detailed captions for RSIC tasks.
  • Figure 2: Qualitative results of image captions generated by our RS-MoE model and the SOTA model for three randomly selected remote sensing images from the RSEval dataset. In captions generated by our model, different colors indicate distinct aspects of the image: orange for the overall theme, purple for objects within the scene, and green for spatial relationships between objects.
  • Figure 3: Qualitative results of image captions generated by RSGPT and our RS-MoE model for two randomly selected remote sensing images from the UCM-Captions dataset. In captions generated by our model, different colors indicate distinct aspects of the image: orange for the overall theme, purple for objects within the scene, and green for spatial relationships between objects.
  • Figure 4: Qualitative results of image captions generated by RSGPT and our RS-MoE model for two randomly selected remote sensing images from the Sydney-Captions dataset. In captions generated by our model, different colors indicate distinct aspects of the image: orange for the overall theme, purple for objects within the scene, and green for spatial relationships between objects.
  • Figure 5: Qualitative results of image captions generated by RSGPT and our RS-MoE model for two randomly selected remote sensing images from the RSICD dataset. In captions generated by our model, different colors indicate distinct aspects of the image: orange for the overall theme, purple for objects within the scene, and green for spatial relationships between objects.
  • ...and 1 more figures