Table of Contents
Fetching ...

RingMoE: Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation

Hanbo Bi, Yingchao Feng, Boyuan Tong, Mengyu Wang, Haichen Yu, Yongqiang Mao, Hao Chang, Wenhui Diao, Peijin Wang, Yue Yu, Hanyang Peng, Yehong Zhang, Kun Fu, Xian Sun

TL;DR

RingMoE introduces a 14.7B, four-modality RS foundation model that combines a hierarchical, MoE-based encoder with modal-specific, collaborative, and shared experts to learn intra- and inter-modal representations. It incorporates physics-informed self-supervised targets, including a power-based reconstruction for SAR-L1, and supports dynamic pruning to deploy compact yet competitive variants (down to 1B). The model is trained on RingMOSS, a 400M-image, multi-modal RS dataset, and demonstrates state-of-the-art performance across 23 of 25 benchmarks spanning six RS tasks, with strong few-shot and cross-modal capabilities. The work also provides flexible deployment strategies (EP, KS/KA/KC) and comprehensive ablations, highlighting the value of modality-aware MoE design for scalable, efficient, and interpretable RS foundation models with practical impact in disaster response, land management, and urban planning.

Abstract

The rapid advancement of foundation models has revolutionized visual representation learning in a self-supervised manner. However, their application in remote sensing (RS) remains constrained by a fundamental gap: existing models predominantly handle single or limited modalities, overlooking the inherently multi-modal nature of RS observations. Optical, synthetic aperture radar (SAR), and multi-spectral data offer complementary insights that significantly reduce the inherent ambiguity and uncertainty in single-source analysis. To bridge this gap, we introduce RingMoE, a unified multi-modal RS foundation model with 14.7 billion parameters, pre-trained on 400 million multi-modal RS images from nine satellites. RingMoE incorporates three key innovations: (1) A hierarchical Mixture-of-Experts (MoE) architecture comprising modal-specialized, collaborative, and shared experts, effectively modeling intra-modal knowledge while capturing cross-modal dependencies to mitigate conflicts between modal representations; (2) Physics-informed self-supervised learning, explicitly embedding sensor-specific radiometric characteristics into the pre-training objectives; (3) Dynamic expert pruning, enabling adaptive model compression from 14.7B to 1B parameters while maintaining performance, facilitating efficient deployment in Earth observation applications. Evaluated across 23 benchmarks spanning six key RS tasks (i.e., classification, detection, segmentation, tracking, change detection, and depth estimation), RingMoE outperforms existing foundation models and sets new SOTAs, demonstrating remarkable adaptability from single-modal to multi-modal scenarios. Beyond theoretical progress, it has been deployed and trialed in multiple sectors, including emergency response, land management, marine sciences, and urban planning.

RingMoE: Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation

TL;DR

RingMoE introduces a 14.7B, four-modality RS foundation model that combines a hierarchical, MoE-based encoder with modal-specific, collaborative, and shared experts to learn intra- and inter-modal representations. It incorporates physics-informed self-supervised targets, including a power-based reconstruction for SAR-L1, and supports dynamic pruning to deploy compact yet competitive variants (down to 1B). The model is trained on RingMOSS, a 400M-image, multi-modal RS dataset, and demonstrates state-of-the-art performance across 23 of 25 benchmarks spanning six RS tasks, with strong few-shot and cross-modal capabilities. The work also provides flexible deployment strategies (EP, KS/KA/KC) and comprehensive ablations, highlighting the value of modality-aware MoE design for scalable, efficient, and interpretable RS foundation models with practical impact in disaster response, land management, and urban planning.

Abstract

The rapid advancement of foundation models has revolutionized visual representation learning in a self-supervised manner. However, their application in remote sensing (RS) remains constrained by a fundamental gap: existing models predominantly handle single or limited modalities, overlooking the inherently multi-modal nature of RS observations. Optical, synthetic aperture radar (SAR), and multi-spectral data offer complementary insights that significantly reduce the inherent ambiguity and uncertainty in single-source analysis. To bridge this gap, we introduce RingMoE, a unified multi-modal RS foundation model with 14.7 billion parameters, pre-trained on 400 million multi-modal RS images from nine satellites. RingMoE incorporates three key innovations: (1) A hierarchical Mixture-of-Experts (MoE) architecture comprising modal-specialized, collaborative, and shared experts, effectively modeling intra-modal knowledge while capturing cross-modal dependencies to mitigate conflicts between modal representations; (2) Physics-informed self-supervised learning, explicitly embedding sensor-specific radiometric characteristics into the pre-training objectives; (3) Dynamic expert pruning, enabling adaptive model compression from 14.7B to 1B parameters while maintaining performance, facilitating efficient deployment in Earth observation applications. Evaluated across 23 benchmarks spanning six key RS tasks (i.e., classification, detection, segmentation, tracking, change detection, and depth estimation), RingMoE outperforms existing foundation models and sets new SOTAs, demonstrating remarkable adaptability from single-modal to multi-modal scenarios. Beyond theoretical progress, it has been deployed and trialed in multiple sectors, including emergency response, land management, marine sciences, and urban planning.

Paper Structure

This paper contains 56 sections, 21 equations, 21 figures, 24 tables.

Figures (21)

  • Figure 1: The motivation for developing our multi-modal RSFM, i.e., RingMoE, is to adaptively process various image interpretation tasks from different RS modalities including optical, multi-spectral, and SAR (in complex-valued form and amplitude form).
  • Figure 2: Parameter-scale of foundation models in CV and RS fieldschen2020simpleradford2021learninggoyal2021selfjia2021scalingzhai2022scalingdehghani2023scalingliu2022swinriquelme2021scalingoquab2023dinov2li2023bliphe2020momentumguo2024skysensefuller2024cromaliu2024remoteclipcha2024billionreed2023scalecong2022satmaewang2022advancingwang2022selfsun2022ringmojain2022multimodalcha2021contrastive. Our RingMoE is the largest in RS and ranks among the top in CV.
  • Figure 3: Structure comparison between the previous RSFMs and the proposed RingMoE. (a) Unimodal RSFMscha2024billionreed2023scalecong2022satmaewang2022advancingsun2022ringmowang2024scaling: Given a certain unimodal input, the latent representations are extracted by the unimodal encoder, followed by decoding by contrast supervision or target reconstruction. (b) Typical Multi-modal RSFMscha2021contrastivejain2022multimodalwang2022selfguo2024skysensefuller2024croma: take SAR-EO cha2021contrastive as an example, these models process two modalities separately through their respective encoders, followed by inter-modal interaction via a multi-modal fusion encoder, with decoding performed through contrastive supervision or target reconstruction. (c) Our RingMoE: Given four modal inputs, RingMoE employs a multi-modal encoder with a sparse RMoE structure that selectively activates different experts for each modality, capturing both inter- and intra-modal correlations. Additionally, modal-specific decoders are introduced for self-supervised learning, incorporating a power loss function to embed radar-specific imaging characteristics.
  • Figure 4: The proposed RingMoE achieves 23 SOTAs on 25 benchmarks in 6 RS key tasks, outperforming existing foundation models.
  • Figure 5: Overview of the proposed RingMoE framework. Given the multi-modal inputs, the random masking operation is performed to generate visible and masked patches, i.e. tokens. The RingMoE encoder incorporates the RMoE layer (i.e., a novel hierarchical Mixture-of-Experts structure) to replace all standard FFN layers in each stage, effectively capturing both inter-modal commonalities and intra-modal specializations in RS. Finally, the latent representations from different modalities are decoded by modal-specific decoders to reconstruct the original targets.
  • ...and 16 more figures