Table of Contents
Fetching ...

Context-Aware Interaction Network for RGB-T Semantic Segmentation

Ying Lv, Zhi Liu, Gongyang Li

TL;DR

The paper tackles RGB-T semantic segmentation by introducing CAINet, a fusion framework that constructs a context-aware interaction space to explicitly exploit cross-modal complementarity across multiple feature levels. It integrates Context-Aware Complementary Reasoning (CACR), Global Context Modeling (GCM), and Detail Aggregation (DA) with Residual Learning and multi-level auxiliary supervision to guide learning and refine segmentation maps. Empirical results on MFNet and PST900 show state-of-the-art performance with strong cross-modal robustness and efficient 12.16M-parameter design, and generalization to RGB-D data further supports broad applicability. The approach advances multimodal fusion by unifying direct and feedback fusion benefits and leveraging global and boundary information for precise, context-rich segmentation relevant to autonomous driving and related tasks.

Abstract

RGB-T semantic segmentation is a key technique for autonomous driving scenes understanding. For the existing RGB-T semantic segmentation methods, however, the effective exploration of the complementary relationship between different modalities is not implemented in the information interaction between multiple levels. To address such an issue, the Context-Aware Interaction Network (CAINet) is proposed for RGB-T semantic segmentation, which constructs interaction space to exploit auxiliary tasks and global context for explicitly guided learning. Specifically, we propose a Context-Aware Complementary Reasoning (CACR) module aimed at establishing the complementary relationship between multimodal features with the long-term context in both spatial and channel dimensions. Further, considering the importance of global contextual and detailed information, we propose the Global Context Modeling (GCM) module and Detail Aggregation (DA) module, and we introduce specific auxiliary supervision to explicitly guide the context interaction and refine the segmentation map. Extensive experiments on two benchmark datasets of MFNet and PST900 demonstrate that the proposed CAINet achieves state-of-the-art performance. The code is available at https://github.com/YingLv1106/CAINet.

Context-Aware Interaction Network for RGB-T Semantic Segmentation

TL;DR

The paper tackles RGB-T semantic segmentation by introducing CAINet, a fusion framework that constructs a context-aware interaction space to explicitly exploit cross-modal complementarity across multiple feature levels. It integrates Context-Aware Complementary Reasoning (CACR), Global Context Modeling (GCM), and Detail Aggregation (DA) with Residual Learning and multi-level auxiliary supervision to guide learning and refine segmentation maps. Empirical results on MFNet and PST900 show state-of-the-art performance with strong cross-modal robustness and efficient 12.16M-parameter design, and generalization to RGB-D data further supports broad applicability. The approach advances multimodal fusion by unifying direct and feedback fusion benefits and leveraging global and boundary information for precise, context-rich segmentation relevant to autonomous driving and related tasks.

Abstract

RGB-T semantic segmentation is a key technique for autonomous driving scenes understanding. For the existing RGB-T semantic segmentation methods, however, the effective exploration of the complementary relationship between different modalities is not implemented in the information interaction between multiple levels. To address such an issue, the Context-Aware Interaction Network (CAINet) is proposed for RGB-T semantic segmentation, which constructs interaction space to exploit auxiliary tasks and global context for explicitly guided learning. Specifically, we propose a Context-Aware Complementary Reasoning (CACR) module aimed at establishing the complementary relationship between multimodal features with the long-term context in both spatial and channel dimensions. Further, considering the importance of global contextual and detailed information, we propose the Global Context Modeling (GCM) module and Detail Aggregation (DA) module, and we introduce specific auxiliary supervision to explicitly guide the context interaction and refine the segmentation map. Extensive experiments on two benchmark datasets of MFNet and PST900 demonstrate that the proposed CAINet achieves state-of-the-art performance. The code is available at https://github.com/YingLv1106/CAINet.
Paper Structure (16 sections, 15 equations, 7 figures, 6 tables)

This paper contains 16 sections, 15 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Samples of RGB-T semantic segmentation from MFNet 2017MFNet dataset, which contains daytime, bright light and nighttime images from top to bottom.
  • Figure 2: Three typical categories of architectures in RGB-T semantic segmentation, including a) direct fusion 2017MFNet2019RTFNet2020PSTNet2021MLFNet2021FuseSeg2021ABMDRNet2022EGFNet2021FEANet2022MTANet2021GMNetzhou2023embedded2023LASNet2022MFFENet, b) feedback fusion 2022CCFFNetliu2022cmxzhou2022multispectral, and c) our proposed context-aware interaction fusion network.
  • Figure 3: The overview of proposed CAINet. Specifically, CAINet consists of six components including RGB and thermal encoders, interaction space reasoning, global context modeling, three-decoder supervision, detailed feature fusion, and residual learning of multiple auxiliary tasks supervision. The residual learning zhou2023embedded (ARLM) module is to assist context-aware complementary reasoning (CACR) and global context modeling (GCM) to implement multimodal feature interaction; the detail aggregation (DA) module refines the final segmentation map. During inference, we can remove the other supervised branches and retain the final predictive semantic segmentation map $P_4$, suggesting that the performance enhancement comes with no added inference cost.
  • Figure 4: Illustration of a) the Context-Aware Complementary Reasoning (CACR) module, b) the Global Context Modeling (GCM) module, where $\delta$, $\gamma$, and $origin$ represent feature maps from different convolution layers and c) the Detail Aggregation (DA) module.
  • Figure 5: Visual comparisons of the proposed method and seven state-of-the-art methods in typical daytime and nighttime images of MFNet. The proposed CAINet provides suitable results under a variety of lighting conditions.
  • ...and 2 more figures