Table of Contents
Fetching ...

BIMII-Net: Brain-Inspired Multi-Iterative Interactive Network for RGB-T Road Scene Semantic Segmentation

Hanshuo Qiu, Jie Jiang, Ruoli Yang, Lixin Zhan, Jizhao Liu

TL;DR

BIMII-Net addresses RGB-T road scene semantic segmentation under challenging illumination by integrating a brain-inspired deep continuous-coupled neural network (DCCNN) with a cross explicit attention-enhanced fusion module (CEAEF) and a complementary interactive multi-layer decoder. The architecture comprises a SegFormer-based encoder with CCNN layers, a dual-branch fusion module, and a three-branch decoder (SFI, DFI, MFE) under a multi-module supervision regime, enabling fine-grained texture capture and global skeleton reasoning. Ablation and comparative experiments on MFNet and PST900 demonstrate strong performance gains, particularly in boundary delineation and small-object segmentation, with robust day/night generalization. The work highlights the viability of brain-inspired computing for multi-modal semantic segmentation and provides a foundation for more efficient, scalable RGB-T models in real-world perception tasks.

Abstract

RGB-T road scene semantic segmentation enhances visual scene understanding in complex environments characterized by inadequate illumination or occlusion by fusing information from RGB and thermal images. Nevertheless, existing RGB-T semantic segmentation models typically depend on simple addition or concatenation strategies or ignore the differences between information at different levels. To address these issues, we proposed a novel RGB-T road scene semantic segmentation network called Brain-Inspired Multi-Iteration Interaction Network (BIMII-Net). First, to meet the requirements of accurate texture and local information extraction in road scenarios like autonomous driving, we proposed a deep continuous-coupled neural network (DCCNN) architecture based on a brain-inspired model. Second, to enhance the interaction and expression capabilities among multi-modal information, we designed a cross explicit attention-enhanced fusion module (CEAEF-Module) in the feature fusion stage of BIMII-Net to effectively integrate features at different levels. Finally, we constructed a complementary interactive multi-layer decoder structure, incorporating the shallow-level feature iteration module (SFI-Module), the deep-level feature iteration module (DFI-Module), and the multi-feature enhancement module (MFE-Module) to collaboratively extract texture details and global skeleton information, with multi-module joint supervision further optimizing the segmentation results. Experimental results demonstrate that BIMII-Net achieves state-of-the-art (SOTA) performance in the brain-inspired computing domain and outperforms most existing RGB-T semantic segmentation methods. It also exhibits strong generalization capabilities on multiple RGB-T datasets, proving the effectiveness of brain-inspired computer models in multi-modal image segmentation tasks.

BIMII-Net: Brain-Inspired Multi-Iterative Interactive Network for RGB-T Road Scene Semantic Segmentation

TL;DR

BIMII-Net addresses RGB-T road scene semantic segmentation under challenging illumination by integrating a brain-inspired deep continuous-coupled neural network (DCCNN) with a cross explicit attention-enhanced fusion module (CEAEF) and a complementary interactive multi-layer decoder. The architecture comprises a SegFormer-based encoder with CCNN layers, a dual-branch fusion module, and a three-branch decoder (SFI, DFI, MFE) under a multi-module supervision regime, enabling fine-grained texture capture and global skeleton reasoning. Ablation and comparative experiments on MFNet and PST900 demonstrate strong performance gains, particularly in boundary delineation and small-object segmentation, with robust day/night generalization. The work highlights the viability of brain-inspired computing for multi-modal semantic segmentation and provides a foundation for more efficient, scalable RGB-T models in real-world perception tasks.

Abstract

RGB-T road scene semantic segmentation enhances visual scene understanding in complex environments characterized by inadequate illumination or occlusion by fusing information from RGB and thermal images. Nevertheless, existing RGB-T semantic segmentation models typically depend on simple addition or concatenation strategies or ignore the differences between information at different levels. To address these issues, we proposed a novel RGB-T road scene semantic segmentation network called Brain-Inspired Multi-Iteration Interaction Network (BIMII-Net). First, to meet the requirements of accurate texture and local information extraction in road scenarios like autonomous driving, we proposed a deep continuous-coupled neural network (DCCNN) architecture based on a brain-inspired model. Second, to enhance the interaction and expression capabilities among multi-modal information, we designed a cross explicit attention-enhanced fusion module (CEAEF-Module) in the feature fusion stage of BIMII-Net to effectively integrate features at different levels. Finally, we constructed a complementary interactive multi-layer decoder structure, incorporating the shallow-level feature iteration module (SFI-Module), the deep-level feature iteration module (DFI-Module), and the multi-feature enhancement module (MFE-Module) to collaboratively extract texture details and global skeleton information, with multi-module joint supervision further optimizing the segmentation results. Experimental results demonstrate that BIMII-Net achieves state-of-the-art (SOTA) performance in the brain-inspired computing domain and outperforms most existing RGB-T semantic segmentation methods. It also exhibits strong generalization capabilities on multiple RGB-T datasets, proving the effectiveness of brain-inspired computer models in multi-modal image segmentation tasks.

Paper Structure

This paper contains 29 sections, 15 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: Three fusion paradigms for RGB-T semantic segmentation: (a) encoder fusion, (b) decoder fusion, (c) feature fusion.
  • Figure 2: Overall architecture of proposed BIMII-Net. BIMII-Net consists of an encoder, a feature fusion part, and a decoder. The encoder is based on the Segformer structure. A CCNN layer is added after each Segformer layer and fused with the output through a residual connection to further iterate features. This part extracts multi-scale features from RGB and thermal images. The feature fusion part fuses the features from the encoder through the CEAEF-Module. The initial inputs in the decoder are generated in the section. The decoder further separates shallow-level features and deep-level features, extracting texture and skeleton information through the SFI-Module and DFI-Module, respectively. Subsequently, the MFE-Module fuses them to generate the final semantic segmentation result. Furthermore, the overall architecture is optimized through a multi-module joint supervision strategy.
  • Figure 3: The overview of the proposed DCCNN architecture. The DCCNN architecture consists of 7 CCNN layers, with the encoder comprising 4 CCNN layers and the decoder including 3 CCNN layers. Signal transmission between layers is facilitated through state parameters. Each CCNN layer processes information across multiple time steps, and its output is obtained by averaging the results of these time steps. Additionally, during the feature fusion stage, the signals from the final encoder layer are averaged before being passed on to the decoder.
  • Figure 4: Proposed CEAEF-Module. The CEAEF-Module consists of two branches, processing RGB and thermal image features, respectively. The introduction of channel attention enhances feature expression capabilities and aligns distributions. The module employs an explicit cross-modal deep interaction approach to improve feature representation and enhance complementarity across modalities. The features are ultimately integrated through a combination of deep separable convolution and spatial attention.
  • Figure 5: Modules overall architecture: (a) The shallow-level feature iteration module (SFI-Module) is mainly used to extract and enhance the texture information in the shallow-level features. (b) The multi-feature enhancement module (MFE-Module) achieves the dynamic integration of shallow-level and deep-level features. (c) The deep-level feature iteration module (DFI-Module) processes deep-level features and extracts global semantic information.
  • ...and 4 more figures