Table of Contents
Fetching ...

Segment Anything with Multiple Modalities

Aoran Xiao, Weihao Xuan, Heli Qi, Yun Xing, Naoto Yokoya, Shijian Lu

TL;DR

The paper tackles robust segmentation in multi-modal sensor environments by extending the Segment Anything Model (SAM) to cross-modal and multi-modal data. It introduces UCMT to align non-RGB modality embeddings with SAM's RGB embedding space via a modality-specific patch embedding, LoRA-based tuning, and an embedding-unification loss $L_U$. It also introduces WMMF with a Selective Fusion Gate and multi-modal pseudo-labeling, forming a joint objective $L = L_U + L_F$ with $L_F = L_{bce}(\,hat{M}_F, M_F) + L_{dice}(\hat{M}_F, M_F)$. Experiments across seven datasets and eight modalities show large, consistent gains over SAM, demonstrating label- and parameter-efficiency and broad applicability to sensor-based perception.

Abstract

Robust and accurate segmentation of scenes has become one core functionality in various visual recognition and navigation tasks. This has inspired the recent development of Segment Anything Model (SAM), a foundation model for general mask segmentation. However, SAM is largely tailored for single-modal RGB images, limiting its applicability to multi-modal data captured with widely-adopted sensor suites, such as LiDAR plus RGB, depth plus RGB, thermal plus RGB, etc. We develop MM-SAM, an extension and expansion of SAM that supports cross-modal and multi-modal processing for robust and enhanced segmentation with different sensor suites. MM-SAM features two key designs, namely, unsupervised cross-modal transfer and weakly-supervised multi-modal fusion, enabling label-efficient and parameter-efficient adaptation toward various sensor modalities. It addresses three main challenges: 1) adaptation toward diverse non-RGB sensors for single-modal processing, 2) synergistic processing of multi-modal data via sensor fusion, and 3) mask-free training for different downstream tasks. Extensive experiments show that MM-SAM consistently outperforms SAM by large margins, demonstrating its effectiveness and robustness across various sensors and data modalities.

Segment Anything with Multiple Modalities

TL;DR

The paper tackles robust segmentation in multi-modal sensor environments by extending the Segment Anything Model (SAM) to cross-modal and multi-modal data. It introduces UCMT to align non-RGB modality embeddings with SAM's RGB embedding space via a modality-specific patch embedding, LoRA-based tuning, and an embedding-unification loss . It also introduces WMMF with a Selective Fusion Gate and multi-modal pseudo-labeling, forming a joint objective with . Experiments across seven datasets and eight modalities show large, consistent gains over SAM, demonstrating label- and parameter-efficiency and broad applicability to sensor-based perception.

Abstract

Robust and accurate segmentation of scenes has become one core functionality in various visual recognition and navigation tasks. This has inspired the recent development of Segment Anything Model (SAM), a foundation model for general mask segmentation. However, SAM is largely tailored for single-modal RGB images, limiting its applicability to multi-modal data captured with widely-adopted sensor suites, such as LiDAR plus RGB, depth plus RGB, thermal plus RGB, etc. We develop MM-SAM, an extension and expansion of SAM that supports cross-modal and multi-modal processing for robust and enhanced segmentation with different sensor suites. MM-SAM features two key designs, namely, unsupervised cross-modal transfer and weakly-supervised multi-modal fusion, enabling label-efficient and parameter-efficient adaptation toward various sensor modalities. It addresses three main challenges: 1) adaptation toward diverse non-RGB sensors for single-modal processing, 2) synergistic processing of multi-modal data via sensor fusion, and 3) mask-free training for different downstream tasks. Extensive experiments show that MM-SAM consistently outperforms SAM by large margins, demonstrating its effectiveness and robustness across various sensors and data modalities.
Paper Structure (25 sections, 4 equations, 6 figures, 8 tables)

This paper contains 25 sections, 4 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: The proposed MM-SAM extends and expands SAM towards multi-modal data with various sensor suites, facilitating cross-modal and multi-modal segmentation without requiring mask annotations in different downstream tasks.
  • Figure 2: MM-SAM extends and expands SAM effectively. (a) Activation heatmap and mask predictions for an example from RGB-D song2015sunrgbd with a box prompt in the 1st column. MM-SAM performs clearly better for cross-modal segmentation of depth, and it also enables superb multi-modal segmentation with modality fusion. (b) MM-SAM demonstrate superior robustness and accuracy across seven multi-modal datasets, each featured by RGB plus a non-RGB modality (SAM on RGB [$\bullet$], SAM on non-RGB X* [$\bullet$], MM-SAM on non-RGB X with cross-modal adaptation [$\bullet$] and MM-SAM on RGB + non-RGB with multi-modal fusion [$\bullet$]). The symbol * denotes false-color images transformed from each non-RGB modality. The radius is normalized by MM-SAM's multi-modal segmentation scores. Bigger area coverage indicates better segmentation. Best viewed in color.
  • Figure 3: Overview of MM-SAM. MM-SAM freezes the entire SAM architecture while tuning it with multi-modal pairs (RGB and non-RGB modal X) for achieving cross-modal and multi-modal segmentation. It consists of two novel tuning modules: 1) Unsupervised Cross-Modal Transfer (UCMT) introduces modality-specific patch embedding module and low-rank (LoRA) structures into SAM’s image encoder for extracting modality-specific X embeddings. An embedding unification loss ($L_U$) aligns X embeddings with SAM’s RGB image embeddings to ensure segmentation compatibility; 2) Weakly-supervised Multi-Modal Fusion (WMMF) incorporates Selective Fusion Gate (SFG) to generate multi-modal embeddings, trained with multi-modal pseudo-labeling for adaptive sensor fusion. The whole training is mask-free. During inference, MM-SAM supports segmentation for single or multiple modality data.
  • Figure 4: Visual illustration of adaptive fusion for enhanced segmentation with MM-SAM, using one sample of paired RGB and thermal images from the MFNet dataset. The second column shows fusion weights from the SFG, where brighter areas represent higher weights.
  • Figure 5: Segmentation performance of MM-SAM on the MFNet ("Total" split) using different parameter-efficient tuning (PEFT) methods in (a) and various ViT backbones in (b).
  • ...and 1 more figures

Theorems & Definitions (2)

  • Remark : Efficiency
  • Remark : Insights