Table of Contents
Fetching ...

Customize Segment Anything Model for Multi-Modal Semantic Segmentation with Mixture of LoRA Experts

Chenyang Zhu, Bin Xiao, Lin Shi, Shoukun Xu, Xu Zheng

TL;DR

This paper addresses the challenge of applying the Segment Anything Model (SAM) to multi-modal semantic segmentation. It introduces a Mixture of LoRA Experts (MLE) that trains modality-specific LoRA adapters and employs MoE routing to fuse features across RGB, depth, event, and LiDAR modalities, while keeping the SAM encoder frozen. A dual-path decoder with an auxiliary head enables effective multi-scale feature fusion and high-resolution mask prediction, achieving state-of-the-art results on DELIVER, MUSES, and MCubeS, including strong robustness to missing or noisy modalities. The approach demonstrates the practical value of modular, parameter-efficient cross-modal adaptation for scalable segmentation in real-world, sensor-diverse environments.

Abstract

The recent Segment Anything Model (SAM) represents a significant breakthrough in scaling segmentation models, delivering strong performance across various downstream applications in the RGB modality. However, directly applying SAM to emerging visual modalities, such as depth and event data results in suboptimal performance in multi-modal segmentation tasks. In this paper, we make the first attempt to adapt SAM for multi-modal semantic segmentation by proposing a Mixture of Low-Rank Adaptation Experts (MoE-LoRA) tailored for different input visual modalities. By training only the MoE-LoRA layers while keeping SAM's weights frozen, SAM's strong generalization and segmentation capabilities can be preserved for downstream tasks. Specifically, to address cross-modal inconsistencies, we propose a novel MoE routing strategy that adaptively generates weighted features across modalities, enhancing multi-modal feature integration. Additionally, we incorporate multi-scale feature extraction and fusion by adapting SAM's segmentation head and introducing an auxiliary segmentation head to combine multi-scale features for improved segmentation performance effectively. Extensive experiments were conducted on three multi-modal benchmarks: DELIVER, MUSES, and MCubeS. The results consistently demonstrate that the proposed method significantly outperforms state-of-the-art approaches across diverse scenarios. Notably, under the particularly challenging condition of missing modalities, our approach exhibits a substantial performance gain, achieving an improvement of 32.15% compared to existing methods.

Customize Segment Anything Model for Multi-Modal Semantic Segmentation with Mixture of LoRA Experts

TL;DR

This paper addresses the challenge of applying the Segment Anything Model (SAM) to multi-modal semantic segmentation. It introduces a Mixture of LoRA Experts (MLE) that trains modality-specific LoRA adapters and employs MoE routing to fuse features across RGB, depth, event, and LiDAR modalities, while keeping the SAM encoder frozen. A dual-path decoder with an auxiliary head enables effective multi-scale feature fusion and high-resolution mask prediction, achieving state-of-the-art results on DELIVER, MUSES, and MCubeS, including strong robustness to missing or noisy modalities. The approach demonstrates the practical value of modular, parameter-efficient cross-modal adaptation for scalable segmentation in real-world, sensor-diverse environments.

Abstract

The recent Segment Anything Model (SAM) represents a significant breakthrough in scaling segmentation models, delivering strong performance across various downstream applications in the RGB modality. However, directly applying SAM to emerging visual modalities, such as depth and event data results in suboptimal performance in multi-modal segmentation tasks. In this paper, we make the first attempt to adapt SAM for multi-modal semantic segmentation by proposing a Mixture of Low-Rank Adaptation Experts (MoE-LoRA) tailored for different input visual modalities. By training only the MoE-LoRA layers while keeping SAM's weights frozen, SAM's strong generalization and segmentation capabilities can be preserved for downstream tasks. Specifically, to address cross-modal inconsistencies, we propose a novel MoE routing strategy that adaptively generates weighted features across modalities, enhancing multi-modal feature integration. Additionally, we incorporate multi-scale feature extraction and fusion by adapting SAM's segmentation head and introducing an auxiliary segmentation head to combine multi-scale features for improved segmentation performance effectively. Extensive experiments were conducted on three multi-modal benchmarks: DELIVER, MUSES, and MCubeS. The results consistently demonstrate that the proposed method significantly outperforms state-of-the-art approaches across diverse scenarios. Notably, under the particularly challenging condition of missing modalities, our approach exhibits a substantial performance gain, achieving an improvement of 32.15% compared to existing methods.

Paper Structure

This paper contains 18 sections, 14 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: (a)Overall of MLE-SAM, (b) Performance on DELIVER (R-D-E-L Modalities), (c) Performance on MUSES (F-E-L Modalities), (d) Evaluation Across Modality Combinations and Scenarios on DELIVER, and (e) on MUSES Datasets.
  • Figure 2: Illustration of the proposed MLE-SAM framework for multi-modal semantic segmentation. The architecture combines multi-scale features from a frozen image encoder fine-tuned with LoRA layers. Semantic feature maps and feature pyramids across modalities are averaged and refined via a top-k mechanism. Fused features are processed with a dual-pathway strategy.
  • Figure 3: Hierarchical Refinement Pathway for High-Resolution Embedding
  • Figure 4: Multi-Scale Feature Fusion Pathway for High-Resolution Embedding
  • Figure 5: Visualization of extracted feature maps of DELIVER under sensor failure cases for RGB, Depth, Event, LiDAR, and R-D-E-L modalities
  • ...and 2 more figures