Table of Contents
Fetching ...

Local Precise Refinement: A Dual-Gated Mixture-of-Experts for Enhancing Foundation Model Generalization against Spectral Shifts

Xi Chen, Maojun Zhang, Yu Liu, Shen Yan

Abstract

Domain Generalization Semantic Segmentation (DGSS) in spectral remote sensing is severely challenged by spectral shifts across diverse acquisition conditions, which cause significant performance degradation for models deployed in unseen domains. While Parameter-Efficient Fine-Tuning (PEFT) on foundation models is a promising direction, existing methods employ global, homogeneous adjustments. This "one-size-fits-all" tuning struggles with the spatial heterogeneity of land cover, causing semantic confusion. We argue that the key to robust DGSS lies not in a single global adaptation, but in performing fine-grained, spatially-adaptive refinement of a foundation model's features. To achieve this, we propose SpectralMoE, a novel PEFT framework for DGSS. It operationalizes this principle by utilizing a Mixture-of-Experts (MoE) architecture to perform local precise refinement on the foundation model's features, incorporating depth features estimated from selected RGB bands of the spectral remote sensing imagery to guide the fine-tuning process. Specifically, SpectralMoE employs a dual-gated MoE architecture that independently routes visual and depth features to top-k selected experts for specialized refinement, enabling modality-specific adjustments. A subsequent cross-attention mechanism then judiciously fuses the refined structural cues into the visual stream, mitigating semantic ambiguities caused by spectral variations. Extensive experiments show that SpectralMoE sets a new state-of-the-art on multiple DGSS benchmarks across hyperspectral, multispectral, and RGB remote sensing imagery.

Local Precise Refinement: A Dual-Gated Mixture-of-Experts for Enhancing Foundation Model Generalization against Spectral Shifts

Abstract

Domain Generalization Semantic Segmentation (DGSS) in spectral remote sensing is severely challenged by spectral shifts across diverse acquisition conditions, which cause significant performance degradation for models deployed in unseen domains. While Parameter-Efficient Fine-Tuning (PEFT) on foundation models is a promising direction, existing methods employ global, homogeneous adjustments. This "one-size-fits-all" tuning struggles with the spatial heterogeneity of land cover, causing semantic confusion. We argue that the key to robust DGSS lies not in a single global adaptation, but in performing fine-grained, spatially-adaptive refinement of a foundation model's features. To achieve this, we propose SpectralMoE, a novel PEFT framework for DGSS. It operationalizes this principle by utilizing a Mixture-of-Experts (MoE) architecture to perform local precise refinement on the foundation model's features, incorporating depth features estimated from selected RGB bands of the spectral remote sensing imagery to guide the fine-tuning process. Specifically, SpectralMoE employs a dual-gated MoE architecture that independently routes visual and depth features to top-k selected experts for specialized refinement, enabling modality-specific adjustments. A subsequent cross-attention mechanism then judiciously fuses the refined structural cues into the visual stream, mitigating semantic ambiguities caused by spectral variations. Extensive experiments show that SpectralMoE sets a new state-of-the-art on multiple DGSS benchmarks across hyperspectral, multispectral, and RGB remote sensing imagery.
Paper Structure (14 sections, 11 equations, 5 figures, 4 tables)

This paper contains 14 sections, 11 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Our SpectralMoE achieves (a) comprehensive SOTA performance across all spectral RS DGSS benchmarks. This superiority stems from our dual-gated MoE, which enables (b) fine-grained, spatially-adaptive adjustments. As shown in the qualitative results (visualized by Grad-CAM, top row), for complex target regions (e.g., the structures highlighted in the green box), our method generates a complete and fine-grained response, in stark contrast to the diffuse activations of global fine-tuning methods. This enhanced local refinement directly translates to qualitatively superior and more robust segmentation results in unseen domains (bottom row).
  • Figure 2: Spectral shift in spectral RS imagery. Variations in sensor characteristics and geospatial conditions can lead to significant divergence in the spectral signatures of land cover features belonging to the same class.
  • Figure 3: Overview of the proposed SpectralMoE framework. SpectralMoE is inserted as a lightweight plugin into each layer of frozen VFMs and DFMs. At its core is a dual-gated MoE mechanism. A dual-gated network independently routes visual and depth feature tokens to specialized experts, enabling fine-grained, spatially-adaptive adjustments that overcome the limitations of global, homogeneous methods. Following this expert-based refinement, a Cross-Attention Fusion Module adaptively injects the robust spatial structural information from the adjusted depth features into the visual features. This fusion process effectively mitigates semantic ambiguity caused by spectral shifts, significantly enhancing the model's cross-domain generalization capability.
  • Figure 4: Qualitative results for Five-Billion-Pixels (cross-sensor) task. Visual comparison of segmentation performance for different methods on the Five-Billion-Pixels (cross-sensor) generalization task. From left to right, the columns show the input image, ground truth, and the predictions of DSTC, DOFA, DINOv3, SET, FADA, REIN, DepthForge, and our proposed SpectralMoE. SpectralMoE demonstrates superior generalization capabilities, producing more accurate and refined segmentation maps compared to other SOTA domain generalization methods.
  • Figure 5: Ablation study on the number of experts ($N_e$).