Segment Anything with Multiple Modalities
Aoran Xiao, Weihao Xuan, Heli Qi, Yun Xing, Naoto Yokoya, Shijian Lu
TL;DR
The paper tackles robust segmentation in multi-modal sensor environments by extending the Segment Anything Model (SAM) to cross-modal and multi-modal data. It introduces UCMT to align non-RGB modality embeddings with SAM's RGB embedding space via a modality-specific patch embedding, LoRA-based tuning, and an embedding-unification loss $L_U$. It also introduces WMMF with a Selective Fusion Gate and multi-modal pseudo-labeling, forming a joint objective $L = L_U + L_F$ with $L_F = L_{bce}(\,hat{M}_F, M_F) + L_{dice}(\hat{M}_F, M_F)$. Experiments across seven datasets and eight modalities show large, consistent gains over SAM, demonstrating label- and parameter-efficiency and broad applicability to sensor-based perception.
Abstract
Robust and accurate segmentation of scenes has become one core functionality in various visual recognition and navigation tasks. This has inspired the recent development of Segment Anything Model (SAM), a foundation model for general mask segmentation. However, SAM is largely tailored for single-modal RGB images, limiting its applicability to multi-modal data captured with widely-adopted sensor suites, such as LiDAR plus RGB, depth plus RGB, thermal plus RGB, etc. We develop MM-SAM, an extension and expansion of SAM that supports cross-modal and multi-modal processing for robust and enhanced segmentation with different sensor suites. MM-SAM features two key designs, namely, unsupervised cross-modal transfer and weakly-supervised multi-modal fusion, enabling label-efficient and parameter-efficient adaptation toward various sensor modalities. It addresses three main challenges: 1) adaptation toward diverse non-RGB sensors for single-modal processing, 2) synergistic processing of multi-modal data via sensor fusion, and 3) mask-free training for different downstream tasks. Extensive experiments show that MM-SAM consistently outperforms SAM by large margins, demonstrating its effectiveness and robustness across various sensors and data modalities.
