Table of Contents
Fetching ...

RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts

Xu Liu, Zhouhui Lian

TL;DR

<RSUniVLM addresses multi-granularity remote-sensing vision-language understanding by unifying image-, region-, and pixel-level tasks within a text-only generation framework. It introduces Granularity-oriented MoE (G-MoE) with three experts and a training-free router, enabling specialized perception across granularities while keeping parameter count around 1B. A two-stage instruction-tuning regime combines RS-specific and general-domain data to align and specialize the model, achieving strong results across 6 tasks on 13 datasets, notably in visual grounding and multi-image change analysis. The work advances practical RS applications by delivering an end-to-end, multi-granularity RS VLM without task-specific heads, enabling versatile, scalable remote sensing reasoning and generation.

Abstract

Remote Sensing Vision-Language Models (RS VLMs) have made much progress in the tasks of remote sensing (RS) image comprehension. While performing well in multi-modal reasoning and multi-turn conversations, the existing models lack pixel-level understanding and struggle with multi-image inputs. In this work, we propose RSUniVLM, a unified, end-to-end RS VLM designed for comprehensive vision understanding across multiple granularity, including image-level, region-level, and pixel-level tasks. RSUniVLM also performs effectively in multi-image analysis, with instances of change detection and change captioning. To enhance the model's ability to capture visual information at different levels without increasing model size, we design a novel architecture called Granularity-oriented Mixture of Experts to constraint the model to about 1 billion parameters. We also construct a large-scale RS instruction-following dataset based on a variety of existing datasets in both RS and general domain, encompassing various tasks such as object localization, visual question answering, and semantic segmentation. Substantial experiments have been conducted to validate the superiority of the proposed RSUniVLM up to state-of-the-art across various RS tasks. Code and model will be available at \href{https://github.com/xuliu-cyber/RSUniVLM}{here}.

RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts

TL;DR

<RSUniVLM addresses multi-granularity remote-sensing vision-language understanding by unifying image-, region-, and pixel-level tasks within a text-only generation framework. It introduces Granularity-oriented MoE (G-MoE) with three experts and a training-free router, enabling specialized perception across granularities while keeping parameter count around 1B. A two-stage instruction-tuning regime combines RS-specific and general-domain data to align and specialize the model, achieving strong results across 6 tasks on 13 datasets, notably in visual grounding and multi-image change analysis. The work advances practical RS applications by delivering an end-to-end, multi-granularity RS VLM without task-specific heads, enabling versatile, scalable remote sensing reasoning and generation.

Abstract

Remote Sensing Vision-Language Models (RS VLMs) have made much progress in the tasks of remote sensing (RS) image comprehension. While performing well in multi-modal reasoning and multi-turn conversations, the existing models lack pixel-level understanding and struggle with multi-image inputs. In this work, we propose RSUniVLM, a unified, end-to-end RS VLM designed for comprehensive vision understanding across multiple granularity, including image-level, region-level, and pixel-level tasks. RSUniVLM also performs effectively in multi-image analysis, with instances of change detection and change captioning. To enhance the model's ability to capture visual information at different levels without increasing model size, we design a novel architecture called Granularity-oriented Mixture of Experts to constraint the model to about 1 billion parameters. We also construct a large-scale RS instruction-following dataset based on a variety of existing datasets in both RS and general domain, encompassing various tasks such as object localization, visual question answering, and semantic segmentation. Substantial experiments have been conducted to validate the superiority of the proposed RSUniVLM up to state-of-the-art across various RS tasks. Code and model will be available at \href{https://github.com/xuliu-cyber/RSUniVLM}{here}.

Paper Structure

This paper contains 37 sections, 1 equation, 6 figures, 12 tables.

Figures (6)

  • Figure 1: RSUniVLM is a unified remote sensing VLM with versatile capabilities across three levels of visual understanding: a) Image captioning and visual question answering at image-level; b) Visual grounding and referring expression generation at region-level; c) Semantic segmentation at pixel-level. Apart from tasks with single-image inputs, our RSUniVLM can also tackle multi-image comprehension tasks, such as change captioning and change detection. The radar chart demonstrates that our model is competitive with others on most datasets, and performs significantly better on VRSBench and DIOR-RSVG.
  • Figure 2: An overview of RSUniVLM. We adopt the classic LLaVA-based architecture, which consists of an image encoder, an MLP connector and a large language model. We unify all tasks into text-only generation, enabling joint optimization for multiple tasks in an end-to-end manner. During stage-2 training, Granularity-oriented MoE is applied to the LLM to enhance model's multi-level vision understanding.
  • Figure 3: Qualitative Results of RSUniVLM across a variety of tasks, demonstrating the ability of RSUniVLM to handle multi-level visual granularity tasks. Moreover, RSUniVLM perform well in change analysis involving multi-image input.
  • Figure 4: Qualitative Results of visual grounding.
  • Figure 5: Qualitative Results of semantic segmentation.
  • ...and 1 more figures