Table of Contents
Fetching ...

Spectral-Aware Global Fusion for RGB-Thermal Semantic Segmentation

Ce Zhang, Zifu Wan, Simon Stepputtis, Katia Sycara, Yaqi Xie

TL;DR

This work tackles RGB-T semantic segmentation under challenging conditions by introducing SGFNet, a spectral-aware fusion network. SGFNet explicitly prioritizes higher-frequency, modality-specific details through spectral-aware feature enhancement and channel attention, and it merges RGB and thermal features via a global cross-modal spatial attention mechanism. The approach demonstrates state-of-the-art performance on MFNet and PST900, validated by ablations showing the contribution of each spectral component and attention module. The results indicate robust, cross-condition performance with practical implications for reliable perception in autonomous systems.

Abstract

Semantic segmentation relying solely on RGB data often struggles in challenging conditions such as low illumination and obscured views, limiting its reliability in critical applications like autonomous driving. To address this, integrating additional thermal radiation data with RGB images demonstrates enhanced performance and robustness. However, how to effectively reconcile the modality discrepancies and fuse the RGB and thermal features remains a well-known challenge. In this work, we address this challenge from a novel spectral perspective. We observe that the multi-modal features can be categorized into two spectral components: low-frequency features that provide broad scene context, including color variations and smooth areas, and high-frequency features that capture modality-specific details such as edges and textures. Inspired by this, we propose the Spectral-aware Global Fusion Network (SGFNet) to effectively enhance and fuse the multi-modal features by explicitly modeling the interactions between the high-frequency, modality-specific features. Our experimental results demonstrate that SGFNet outperforms the state-of-the-art methods on the MFNet and PST900 datasets.

Spectral-Aware Global Fusion for RGB-Thermal Semantic Segmentation

TL;DR

This work tackles RGB-T semantic segmentation under challenging conditions by introducing SGFNet, a spectral-aware fusion network. SGFNet explicitly prioritizes higher-frequency, modality-specific details through spectral-aware feature enhancement and channel attention, and it merges RGB and thermal features via a global cross-modal spatial attention mechanism. The approach demonstrates state-of-the-art performance on MFNet and PST900, validated by ablations showing the contribution of each spectral component and attention module. The results indicate robust, cross-condition performance with practical implications for reliable perception in autonomous systems.

Abstract

Semantic segmentation relying solely on RGB data often struggles in challenging conditions such as low illumination and obscured views, limiting its reliability in critical applications like autonomous driving. To address this, integrating additional thermal radiation data with RGB images demonstrates enhanced performance and robustness. However, how to effectively reconcile the modality discrepancies and fuse the RGB and thermal features remains a well-known challenge. In this work, we address this challenge from a novel spectral perspective. We observe that the multi-modal features can be categorized into two spectral components: low-frequency features that provide broad scene context, including color variations and smooth areas, and high-frequency features that capture modality-specific details such as edges and textures. Inspired by this, we propose the Spectral-aware Global Fusion Network (SGFNet) to effectively enhance and fuse the multi-modal features by explicitly modeling the interactions between the high-frequency, modality-specific features. Our experimental results demonstrate that SGFNet outperforms the state-of-the-art methods on the MFNet and PST900 datasets.

Paper Structure

This paper contains 12 sections, 14 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Interpreting an image pair from a spectral perspective. The high-frequency components are extracted by high-pass filtering (HPF) of the image's Fourier spectrum. The attention maps are visualized using Grad-CAM selvaraju2017grad. We demonstrate that the HPF images capture texture and edge details that are specific to each modality. In our proposed SGFNet, we explicitly consider the interaction among these high-frequency components to effectively fuse multi-modal features and enhance segmentation performance.
  • Figure 2: The overall framework of our SGFNet. The SGF module is designed to effectively fuse multi-scale features derived from both RGB and thermal encoders. We also employ a deep supervision mechanism to supervise the preliminary prediction map at each scale.
  • Figure 3: Architecture of the proposed spectral-aware global fusion (SGF) module. This module is composed of three sequential parts: spectral-aware feature enhancement, spectral-aware channel attention, and global cross-modal attention, respectively.
  • Figure 4: Qualitative comparisons on the MFNet ha2017mfnet dataset. We show the segmentation results for four test instances as examples: two captured during daytime (top) and the other two at nighttime (bottom). For easier comparison, the red boxes highlight specific areas of interest. Our proposed SGFNet demonstrates superior segmentation accuracy under various illumination conditions.