MMSFormer: Multimodal Transformer for Material and Semantic Segmentation

Md Kaykobad Reza; Ashley Prater-Bennette; M. Salman Asif

MMSFormer: Multimodal Transformer for Material and Semantic Segmentation

Md Kaykobad Reza, Ashley Prater-Bennette, M. Salman Asif

TL;DR

This work tackles multimodal material and semantic segmentation by designing MMSFormer, a transformer-based framework that fuses features from arbitrary modality sets via a novel Multimodal Fusion Block. The fusion block combines per-modality features through linear fusion, parallel multi-scale convolutions, and channel attention, enabling efficient and effective integration across modalities. Across MCubeS, FMB, and PST900 datasets, MMSFormer achieves state-of-the-art results and shows that adding modalities yields consistent, incremental improvements, with ablations confirming the importance of each fusion-block component. The approach offers a scalable, modality-agnostic path toward robust multimodal segmentation, though future work could explore shared encoders to further reduce parameter counts and extend to additional modalities and tasks.

Abstract

Leveraging information across diverse modalities is known to enhance performance on multimodal segmentation tasks. However, effectively fusing information from different modalities remains challenging due to the unique characteristics of each modality. In this paper, we propose a novel fusion strategy that can effectively fuse information from different modality combinations. We also propose a new model named Multi-Modal Segmentation TransFormer (MMSFormer) that incorporates the proposed fusion strategy to perform multimodal material and semantic segmentation tasks. MMSFormer outperforms current state-of-the-art models on three different datasets. As we begin with only one input modality, performance improves progressively as additional modalities are incorporated, showcasing the effectiveness of the fusion block in combining useful information from diverse input modalities. Ablation studies show that different modules in the fusion block are crucial for overall model performance. Furthermore, our ablation studies also highlight the capacity of different input modalities to improve performance in the identification of different types of materials. The code and pretrained models will be made available at https://github.com/csiplab/MMSFormer.

MMSFormer: Multimodal Transformer for Material and Semantic Segmentation

TL;DR

Abstract

Paper Structure (17 sections, 10 equations, 3 figures, 9 tables)

This paper contains 17 sections, 10 equations, 3 figures, 9 tables.

Introduction
Related Work
Proposed Model
Overall Model Architecture
Modality Specific Encoder
Multimodal Fusion Block
Shared MLP Decoder
Experiments and Results
Datasets
Implementation Details
Performance Comparison with Existing Methods
Performance Comparison for Incremental Modality Integration
Qualitative analysis of the Predictions
Ablation Study on the Fusion Block
Ablation Study on Different Modality Combinations
...and 2 more sections

Figures (3)

Figure 1: (a) Overall architecture of MMSFormer model. Each image passes through a modality-specific encoder where we extract hierarchical features. Then we fuse the extracted features using the proposed fusion block and pass the fused features to the decoder for predicting the segmentation map. (b) Illustration of the mix transformer xie2021segformer block. Each block applies a spatial reduction before applying multi-head attention to reduce computational cost. (c) Proposed multimodal fusion block. We first concatenate all the features along the channel dimension and pass it through linear fusion layer to fuse them. Then the feature tensor is fed to linear projection and parallel convolution layers to capture multi-scale features. We use Squeeze and Excitation block hu2019squeezeandexcitation as channel attention in the residual connection to dynamically re-calibrate the features along the channel dimension.
Figure 2: Visualization of predictions on MCubeS and PST900 datasets. Figure \ref{['fig:vis-mcubes-sota']} shows RGB and all modalities (RGB-A-D-N) prediction from CMNeXt zhang2023CMNext and our model on MCubeS dataset. For brevity, we only show the RGB image and ground truth material segmentation maps along with the predictions. Figure \ref{['fig:vis-pst-sota']} shows predictions from RTFNet sun2019rtfnet, FDCNet zhao2023FDCNet and our model for RGB-thermal input modalities on PST900 dataset. Our model shows better predictions on both of the datasets.
Figure 3: Visualization of predicted segmentation maps for different modality combinations on MCubeS Liang2022MCubeS and FMB liu2023segmif datasets. Both figures show that prediction accuracy increases as we incrementally add new modalities. They also illustrate the fusion block's ability to effectively combine information from different modality combinations.

MMSFormer: Multimodal Transformer for Material and Semantic Segmentation

TL;DR

Abstract

MMSFormer: Multimodal Transformer for Material and Semantic Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)