Table of Contents
Fetching ...

Parameter-Efficient Modality-Balanced Symmetric Fusion for Multimodal Remote Sensing Semantic Segmentation

Haocheng Li, Juepeng Zheng, Shuangxi Miao, Ruibo Lu, Guosheng Cai, Haohuan Fu, Jianxi Huang

Abstract

Multimodal remote sensing semantic segmentation enhances scene interpretation by exploiting complementary physical cues from heterogeneous data. Although pretrained Vision Foundation Models (VFMs) provide strong general-purpose representations, adapting them to multimodal tasks often incurs substantial computational overhead and is prone to modality imbalance, where the contribution of auxiliary modalities is suppressed during optimization. To address these challenges, we propose MoBaNet, a parameter-efficient and modality-balanced symmetric fusion framework. Built upon a largely frozen VFM backbone, MoBaNet adopts a symmetric dual-stream architecture to preserve generalizable representations while minimizing the number of trainable parameters. Specifically, we design a Cross-modal Prompt-Injected Adapter (CPIA) to enable deep semantic interaction by generating shared prompts and injecting them into bottleneck adapters under the frozen backbone. To obtain compact and discriminative multimodal representations for decoding, we further introduce a Difference-Guided Gated Fusion Module (DGFM), which adaptively fuses paired stage features by explicitly leveraging cross-modal discrepancy to guide feature selection. Furthermore, we propose a Modality-Conditional Random Masking (MCRM) strategy to mitigate modality imbalance by masking one modality only during training and imposing hard-pixel auxiliary supervision on modality-specific branches. Extensive experiments on the ISPRS Vaihingen and Potsdam benchmarks demonstrate that MoBaNet achieves state-of-the-art performance with significantly fewer trainable parameters than full fine-tuning, validating its effectiveness for robust and balanced multimodal fusion. The source code in this work is available at https://github.com/sauryeo/MoBaNet.

Parameter-Efficient Modality-Balanced Symmetric Fusion for Multimodal Remote Sensing Semantic Segmentation

Abstract

Multimodal remote sensing semantic segmentation enhances scene interpretation by exploiting complementary physical cues from heterogeneous data. Although pretrained Vision Foundation Models (VFMs) provide strong general-purpose representations, adapting them to multimodal tasks often incurs substantial computational overhead and is prone to modality imbalance, where the contribution of auxiliary modalities is suppressed during optimization. To address these challenges, we propose MoBaNet, a parameter-efficient and modality-balanced symmetric fusion framework. Built upon a largely frozen VFM backbone, MoBaNet adopts a symmetric dual-stream architecture to preserve generalizable representations while minimizing the number of trainable parameters. Specifically, we design a Cross-modal Prompt-Injected Adapter (CPIA) to enable deep semantic interaction by generating shared prompts and injecting them into bottleneck adapters under the frozen backbone. To obtain compact and discriminative multimodal representations for decoding, we further introduce a Difference-Guided Gated Fusion Module (DGFM), which adaptively fuses paired stage features by explicitly leveraging cross-modal discrepancy to guide feature selection. Furthermore, we propose a Modality-Conditional Random Masking (MCRM) strategy to mitigate modality imbalance by masking one modality only during training and imposing hard-pixel auxiliary supervision on modality-specific branches. Extensive experiments on the ISPRS Vaihingen and Potsdam benchmarks demonstrate that MoBaNet achieves state-of-the-art performance with significantly fewer trainable parameters than full fine-tuning, validating its effectiveness for robust and balanced multimodal fusion. The source code in this work is available at https://github.com/sauryeo/MoBaNet.
Paper Structure (30 sections, 18 equations, 6 figures, 5 tables)

This paper contains 30 sections, 18 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overall framework of the proposed multimodal segmentation model with MCRM training. A pretrained vision foundation model initializes a shared frozen ViT backbone. RGB and DSM inputs are first tokenized by modality-specific embeddings, then processed by a stage-wise dual-stream encoder, where CPIA is inserted before each selected stage and DGFM is applied after each selected stage to produce multi-level fused features for the decoder. During training, MCRM performs modality-conditional random masking and hard-pixel auxiliary supervision on one main branch (S1) and two auxiliary modality branches (S2, S3), while only the main branch is used for inference.
  • Figure 2: Structure of the CPIA module. A shared semantic base is generated from paired RGB and auxiliary tokens, transformed into modality-specific prompts by TFT, and injected into lightweight bottleneck adapters for symmetric cross-modal semantic modulation before each selected stage.
  • Figure 3: Structure of the DGFM module. Reduced RGB and DSM features, together with their discrepancy cue, are fed into a lightweight gate network to predict an adaptive gating map, which dynamically fuses the two modalities into a compact multimodal representation.
  • Figure 4: Visualized comparisons on the Vaihingen test set with the size of 512 × 512. (a) U-Net, (b) ABCNet, (c) DC-Swin, (d) UNetFormer, (e) FTransUNet, (f) MANet, and (g) proposed MoBaNet. To highlight the differences, some purple boxes are added to all subfigures.
  • Figure 5: Visualized comparisons on the Potsdam test set with the size of 512 × 512. (a) U-Net, (b) ABCNet, (c) DC-Swin, (d) UNetFormer, (e) FTransUNet, (f) MANet, and (g) proposed MoBaNet. To highlight the differences, some purple boxes are added to all subfigures.
  • ...and 1 more figures