Table of Contents
Fetching ...

Rethinking Early-Fusion Strategies for Improved Multimodal Image Segmentation

Zhengwen Shen, Yulian Li, Han Zhang, Yuchen Weng, Jun Wang

TL;DR

Rethinking RGB-T semantic segmentation, this work introduces EFNet, an early fusion framework that reduces computational burden while maintaining accuracy. It combines a Multimodal Feature Interaction and Fusion module (MIF), Dual-distance Balanced Token Clustering (DBTC) for efficient downsampling, and a lightweight Multi-scale Feature Aggregation Decoder (MFAD) based on Euclidean distance. Across MFNet, PST900, and FMB datasets, EFNet achieves state-of-the-art performance with substantially fewer parameters and FLOPs compared to prior methods. The approach demonstrates that early fusion with principled token clustering and distance-based decoding yields robust multimodal segmentation in challenging illumination conditions.

Abstract

RGB and thermal image fusion have great potential to exhibit improved semantic segmentation in low-illumination conditions. Existing methods typically employ a two-branch encoder framework for multimodal feature extraction and design complicated feature fusion strategies to achieve feature extraction and fusion for multimodal semantic segmentation. However, these methods require massive parameter updates and computational effort during the feature extraction and fusion. To address this issue, we propose a novel multimodal fusion network (EFNet) based on an early fusion strategy and a simple but effective feature clustering for training efficient RGB-T semantic segmentation. In addition, we also propose a lightweight and efficient multi-scale feature aggregation decoder based on Euclidean distance. We validate the effectiveness of our method on different datasets and outperform previous state-of-the-art methods with lower parameters and computation.

Rethinking Early-Fusion Strategies for Improved Multimodal Image Segmentation

TL;DR

Rethinking RGB-T semantic segmentation, this work introduces EFNet, an early fusion framework that reduces computational burden while maintaining accuracy. It combines a Multimodal Feature Interaction and Fusion module (MIF), Dual-distance Balanced Token Clustering (DBTC) for efficient downsampling, and a lightweight Multi-scale Feature Aggregation Decoder (MFAD) based on Euclidean distance. Across MFNet, PST900, and FMB datasets, EFNet achieves state-of-the-art performance with substantially fewer parameters and FLOPs compared to prior methods. The approach demonstrates that early fusion with principled token clustering and distance-based decoding yields robust multimodal segmentation in challenging illumination conditions.

Abstract

RGB and thermal image fusion have great potential to exhibit improved semantic segmentation in low-illumination conditions. Existing methods typically employ a two-branch encoder framework for multimodal feature extraction and design complicated feature fusion strategies to achieve feature extraction and fusion for multimodal semantic segmentation. However, these methods require massive parameter updates and computational effort during the feature extraction and fusion. To address this issue, we propose a novel multimodal fusion network (EFNet) based on an early fusion strategy and a simple but effective feature clustering for training efficient RGB-T semantic segmentation. In addition, we also propose a lightweight and efficient multi-scale feature aggregation decoder based on Euclidean distance. We validate the effectiveness of our method on different datasets and outperform previous state-of-the-art methods with lower parameters and computation.
Paper Structure (12 sections, 8 equations, 2 figures, 7 tables)

This paper contains 12 sections, 8 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Feature selection based on thresholds or clustering.
  • Figure 2: Overall architecture of the proposed method.