Table of Contents
Fetching ...

U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation

Bingyu Li, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li

TL;DR

U3M is introduced: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation, which involves an unbiased integration of multimodal visual data and employs feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.

Abstract

Multimodal semantic segmentation is a pivotal component of computer vision and typically surpasses unimodal methods by utilizing rich information set from various sources.Current models frequently adopt modality-specific frameworks that inherently biases toward certain modalities. Although these biases might be advantageous in specific situations, they generally limit the adaptability of the models across different multimodal contexts, thereby potentially impairing performance. To address this issue, we leverage the inherent capabilities of the model itself to discover the optimal equilibrium in multimodal fusion and introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation. Specifically, this method involves an unbiased integration of multimodal visual data. Additionally, we employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features. Experimental results demonstrate that our approach achieves superior performance across multiple datasets, verifing its efficacy in enhancing the robustness and versatility of semantic segmentation in diverse settings. Our code is available at U3M-multimodal-semantic-segmentation.

U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation

TL;DR

U3M is introduced: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation, which involves an unbiased integration of multimodal visual data and employs feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.

Abstract

Multimodal semantic segmentation is a pivotal component of computer vision and typically surpasses unimodal methods by utilizing rich information set from various sources.Current models frequently adopt modality-specific frameworks that inherently biases toward certain modalities. Although these biases might be advantageous in specific situations, they generally limit the adaptability of the models across different multimodal contexts, thereby potentially impairing performance. To address this issue, we leverage the inherent capabilities of the model itself to discover the optimal equilibrium in multimodal fusion and introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation. Specifically, this method involves an unbiased integration of multimodal visual data. Additionally, we employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features. Experimental results demonstrate that our approach achieves superior performance across multiple datasets, verifing its efficacy in enhancing the robustness and versatility of semantic segmentation in diverse settings. Our code is available at U3M-multimodal-semantic-segmentation.
Paper Structure (17 sections, 21 equations, 8 figures, 6 tables)

This paper contains 17 sections, 21 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: The Evolution of Multimodal Semantic Segmentation Model Architectures. (a) Training a feature extractor using only RGB images. (b) Sharing a trainable feature extractor between RGB images and other modalities. (c) Up: Sharing a fine-tunable pre-trained feature extractor between RGB images and other modalities. Down: Fine-tuning adapters for different modalities with one frozen feature extrator. (d) Up: Single-scale feature fusion within the model. Down: Multiscale feature fusion.
  • Figure 2: The dynamic dominant correlation of multimodal data in different scenes. Left: Under conditions of insufficient light, infrared images can capture more intricate details than RGB images. Middle: In outdoor situations where light is abundant, infrared images tend to lose more details, while RGB images showcase their superiority. Right: In certain common instances, the detailed information in infrared images and RGB images can serve as a complement to each other.
  • Figure 3: Unbiased Multiscale Modal Fusion Model. Utilizing Segformer xie2021segformer with frozen parameters as the feature extractor. Each modality's information is fed into respective feature extractors, divided into four distinct scales for unbiased fusion of multiscale information. Each feature fusion layer comprises two modules based on multiscale pooling and convolution, adaptively extracting features with varied scales. In the end, the multiscale information is concatenated and fed into a shared semantic segmentation head to generate segmentation results.
  • Figure 4: Multiscale Feature Fusion Module. To enhance the extraction of information across multiple scales, a multiscale feature extractor is proposed.
  • Figure 5: Multiscale pooling and convolution. Pooling and convolution at different scales are capable of capturing local and global features across multiple levels, thereby complementing the global attention mechanisms integrated within the backbone architecture effectively. This approach ensures that the resultant fused features encompass a comprehensive focus on both local and global dimensions.
  • ...and 3 more figures