Table of Contents
Fetching ...

SkyMoE: A Vision-Language Foundation Model for Enhancing Geospatial Interpretation with Mixture of Experts

Jiaqi Liu, Ronghao Fu, Lang Sun, Haoran Liu, Xiao Yang, Weipeng Zhang, Xu Na, Zhuoran Duan, Bo Yang

TL;DR

SkyMoE introduces a Mixture-of-Experts vision-language model tailored for remote sensing, featuring an adaptive router that assigns tasks to specialized experts and a context-disentangled augmentation to promote local/global feature balance. The approach is trained in two stages—foundation multimodal understanding followed by MoE specialization—and evaluated on a new RS benchmark (MGRS-Bench) spanning multiple granularities. Across 21 public datasets and five RS tasks, SkyMoE achieves state-of-the-art or competitive results, with ablations confirming the synergistic gains from expert routing and granularity-focused augmentation. The work provides a scalable, interpretable framework for robust geospatial interpretation with strong practical impact in multi-scale RS understanding.

Abstract

The emergence of large vision-language models (VLMs) has significantly enhanced the efficiency and flexibility of geospatial interpretation. However, general-purpose VLMs remain suboptimal for remote sensing (RS) tasks. Existing geospatial VLMs typically adopt a unified modeling strategy and struggle to differentiate between task types and interpretation granularities, limiting their ability to balance local detail perception and global contextual understanding. In this paper, we present SkyMoE, a Mixture-of-Experts (MoE) vision-language model tailored for multimodal, multi-task RS interpretation. SkyMoE employs an adaptive router that generates task- and granularity-aware routing instructions, enabling specialized large language model experts to handle diverse sub-tasks. To further promote expert decoupling and granularity sensitivity, we introduce a context-disentangled augmentation strategy that creates contrastive pairs between local and global features, guiding experts toward level-specific representation learning. We also construct MGRS-Bench, a comprehensive benchmark covering multiple RS interpretation tasks and granularity levels, to evaluate generalization in complex scenarios. Extensive experiments on 21 public datasets demonstrate that SkyMoE achieves state-of-the-art performance across tasks, validating its adaptability, scalability, and superior multi-granularity understanding in remote sensing.

SkyMoE: A Vision-Language Foundation Model for Enhancing Geospatial Interpretation with Mixture of Experts

TL;DR

SkyMoE introduces a Mixture-of-Experts vision-language model tailored for remote sensing, featuring an adaptive router that assigns tasks to specialized experts and a context-disentangled augmentation to promote local/global feature balance. The approach is trained in two stages—foundation multimodal understanding followed by MoE specialization—and evaluated on a new RS benchmark (MGRS-Bench) spanning multiple granularities. Across 21 public datasets and five RS tasks, SkyMoE achieves state-of-the-art or competitive results, with ablations confirming the synergistic gains from expert routing and granularity-focused augmentation. The work provides a scalable, interpretable framework for robust geospatial interpretation with strong practical impact in multi-scale RS understanding.

Abstract

The emergence of large vision-language models (VLMs) has significantly enhanced the efficiency and flexibility of geospatial interpretation. However, general-purpose VLMs remain suboptimal for remote sensing (RS) tasks. Existing geospatial VLMs typically adopt a unified modeling strategy and struggle to differentiate between task types and interpretation granularities, limiting their ability to balance local detail perception and global contextual understanding. In this paper, we present SkyMoE, a Mixture-of-Experts (MoE) vision-language model tailored for multimodal, multi-task RS interpretation. SkyMoE employs an adaptive router that generates task- and granularity-aware routing instructions, enabling specialized large language model experts to handle diverse sub-tasks. To further promote expert decoupling and granularity sensitivity, we introduce a context-disentangled augmentation strategy that creates contrastive pairs between local and global features, guiding experts toward level-specific representation learning. We also construct MGRS-Bench, a comprehensive benchmark covering multiple RS interpretation tasks and granularity levels, to evaluate generalization in complex scenarios. Extensive experiments on 21 public datasets demonstrate that SkyMoE achieves state-of-the-art performance across tasks, validating its adaptability, scalability, and superior multi-granularity understanding in remote sensing.

Paper Structure

This paper contains 16 sections, 8 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Selective masking of objects in images reveals minimal variation in model-provided counts, indicating a reliance on background context over precise enumeration.
  • Figure 2: Overall performance comparison between SkyMoE and eight state-of-the-art models across 21 datasets spanning five remote sensing interpretation tasks. The radar chart shows that SkyMoE achieves competitive performance on most benchmarks.
  • Figure 3: Training framework and strategy. SkyMoE adopts the standard vision-language framework composed of an image encoder, visual adaptor, and decoder-only LLM. The training employs a two-phase approach, Stage I: Initial LLM pretraining establishes multimodal understanding without MoE layers, followed by Stage II: MoE specialization through expert initialization via cloned FFN weights and subsequent fine-tuning.
  • Figure 4: Context-Disentangled Data Augmentation.