Table of Contents
Fetching ...

Efficient Semantic Segmentation via Lightweight Multiple-Information Interaction Network

Yangyang Qiu, Guoan Xu, Guangwei Gao, Zhenhua Guo, Yi Yu, Chia-Wen Lin

TL;DR

The paper tackles real-time semantic segmentation by fusing CNNs and Transformers through lightweight multi-information interactions. It introduces LMIINet, featuring Lightweight Feature Interaction Bottlenecks (LFIB), an improved Flatten Transformer with Focused Linear Attention Module (FLAM) and Channel Attention Block (CAB), and a Combination Coefficient learning scheme to enhance cross-branch feature interaction. With an encoder–decoder design and long connection modules, LMIINet achieves high efficiency, recording 72.0% mIoU on Cityscapes at 100 FPS and 69.94% on CamVid at 160 FPS on a single RTX2080Ti, using only 0.72M parameters. The results demonstrate a favorable accuracy–speed balance, validating the proposed lightweight interaction framework for practical real-time segmentation tasks.

Abstract

Recently, integrating the local modeling capabilities of Convolutional Neural Networks (CNNs) with the global dependency strengths of Transformers has created a sensation in the semantic segmentation community. However, substantial computational workloads and high hardware memory demands remain major obstacles to their further application in real-time scenarios. In this work, we propose a Lightweight Multiple-Information Interaction Network (LMIINet) for real-time semantic segmentation, which effectively combines CNNs and Transformers while reducing redundant computations and memory footprints. It features Lightweight Feature Interaction Bottleneck (LFIB) modules comprising efficient convolutions that enhance context integration. Additionally, improvements are made to the Flatten Transformer by enhancing local and global feature interaction to capture detailed semantic information. Incorporating a combination coefficient learning scheme in both LFIB and Transformer blocks facilitates improved feature interaction. Extensive experiments demonstrate that LMIINet excels in balancing accuracy and efficiency. With only 0.72M parameters and 11.74G FLOPs (Floating Point Operations Per Second), LMIINet achieves 72.0\% mIoU at 100 FPS (Frames Per Second) on the Cityscapes test set and 69.94\% mIoU (mean Intersection over Union) at 160 FPS on the CamVid test dataset using a single RTX2080Ti GPU.

Efficient Semantic Segmentation via Lightweight Multiple-Information Interaction Network

TL;DR

The paper tackles real-time semantic segmentation by fusing CNNs and Transformers through lightweight multi-information interactions. It introduces LMIINet, featuring Lightweight Feature Interaction Bottlenecks (LFIB), an improved Flatten Transformer with Focused Linear Attention Module (FLAM) and Channel Attention Block (CAB), and a Combination Coefficient learning scheme to enhance cross-branch feature interaction. With an encoder–decoder design and long connection modules, LMIINet achieves high efficiency, recording 72.0% mIoU on Cityscapes at 100 FPS and 69.94% on CamVid at 160 FPS on a single RTX2080Ti, using only 0.72M parameters. The results demonstrate a favorable accuracy–speed balance, validating the proposed lightweight interaction framework for practical real-time segmentation tasks.

Abstract

Recently, integrating the local modeling capabilities of Convolutional Neural Networks (CNNs) with the global dependency strengths of Transformers has created a sensation in the semantic segmentation community. However, substantial computational workloads and high hardware memory demands remain major obstacles to their further application in real-time scenarios. In this work, we propose a Lightweight Multiple-Information Interaction Network (LMIINet) for real-time semantic segmentation, which effectively combines CNNs and Transformers while reducing redundant computations and memory footprints. It features Lightweight Feature Interaction Bottleneck (LFIB) modules comprising efficient convolutions that enhance context integration. Additionally, improvements are made to the Flatten Transformer by enhancing local and global feature interaction to capture detailed semantic information. Incorporating a combination coefficient learning scheme in both LFIB and Transformer blocks facilitates improved feature interaction. Extensive experiments demonstrate that LMIINet excels in balancing accuracy and efficiency. With only 0.72M parameters and 11.74G FLOPs (Floating Point Operations Per Second), LMIINet achieves 72.0\% mIoU at 100 FPS (Frames Per Second) on the Cityscapes test set and 69.94\% mIoU (mean Intersection over Union) at 160 FPS on the CamVid test dataset using a single RTX2080Ti GPU.
Paper Structure (15 sections, 16 equations, 6 figures, 9 tables)

This paper contains 15 sections, 16 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Accuracy-Parameters-Speed evaluations on the Cityscapes test dataset under the same device.
  • Figure 2: The complete architecture of the proposed Lightweight Multiple-Information Interaction Network (LMIINet). It consists of three parts: the decoding stage, the encoding stage, and the improved flatten Transformer.
  • Figure 3: The diagram of the proposed Lightweight Feature Interaction Bottleneck (LFIB), improved Flatten Transformer, Segmentation Head (SegHead), and Channel Attention Block (CAB). $D$ represents the depth-wise convolution, $R$ is the kernel of dilated convolution, and $CS$ denotes the channel shuffle operation.
  • Figure 4: The diagram of the Combination Coefficient learning (CC) scheme, Channel Recurrent Unit (CRU), and the Feature Enhancement (FE) module.
  • Figure 5: Visual comparisons on the Cityscapes dataset. From top to bottom are original input images, ground truths, and segmentation results from our LMIINet, SGCPNet hao2022real, LEDNet wang2019lednet, ESPNet mehta2018espnet, and DABNet li2019dabnet.
  • ...and 1 more figures