SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation

Zhengze Xu; Dongyue Wu; Changqian Yu; Xiangxiang Chu; Nong Sang; Changxin Gao

SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation

Zhengze Xu, Dongyue Wu, Changqian Yu, Xiangxiang Chu, Nong Sang, Changxin Gao

TL;DR

SCTNet tackles the challenge of real-time semantic segmentation by disentangling semantic richness from inference-time cost. It trains a CNN backbone to learn long-range context guided by a training-time transformer, using the Conv-Former Block and Semantic Information Alignment Module to align semantic representations without adding inference overhead. The approach yields state-of-the-art speed-accuracy trade-offs on Cityscapes, ADE20K, and COCO-Stuff-10K, demonstrating that transformer-level semantics can be infused into a single-branch CNN during training. This offers a practical path to high-accuracy, fast segmentation suitable for real-time applications and informs future designs that leverage training-time heterogeneity to boost single-branch networks.

Abstract

Recent real-time semantic segmentation methods usually adopt an additional semantic branch to pursue rich long-range context. However, the additional branch incurs undesirable computational overhead and slows inference speed. To eliminate this dilemma, we propose SCTNet, a single branch CNN with transformer semantic information for real-time segmentation. SCTNet enjoys the rich semantic representations of an inference-free semantic branch while retaining the high efficiency of lightweight single branch CNN. SCTNet utilizes a transformer as the training-only semantic branch considering its superb ability to extract long-range context. With the help of the proposed transformer-like CNN block CFBlock and the semantic information alignment module, SCTNet could capture the rich semantic information from the transformer branch in training. During the inference, only the single branch CNN needs to be deployed. We conduct extensive experiments on Cityscapes, ADE20K, and COCO-Stuff-10K, and the results show that our method achieves the new state-of-the-art performance. The code and model is available at https://github.com/xzz777/SCTNet

SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation

TL;DR

Abstract

Paper Structure (33 sections, 5 equations, 10 figures, 12 tables)

This paper contains 33 sections, 5 equations, 10 figures, 12 tables.

Introduction
Related Work
Methodology
Motivation
Conv-Former Block
Semantic Information Alignment Module
Overall Architecture
Alignment Loss
Experiments
Datasets and Implementation Details
Comparison with State-of-the-art Methods
Ablation Study
Visualization Results
Conclusion
Acknowledgments
...and 18 more sections

Figures (10)

Figure 1: The speed-accuracy performance on Cityscapes validation set. Our methods are presented in red stars, while others are presented in blue dots. Our SCTNet establishes a new state-of-the-art speed-accuracy trade-off.
Figure 2: Real-time semantic segmentation paradigms. (a) Decoupled bilateral network divides a semantic branch and a spatial branch at the early stage. (b) Feature sharing bilateral network separates the two branches at the latter stage and adopts dense fusion modules. (c) Our SCTNet applies a single hierarchy branch with a semantic extraction transformer, free from the extra branch and costly fusion module in inference. FM: Fusion Module, SIAM: Semantic Information Alignment Module. Dashed arrows and boxes denote training-only.
Figure 3: The architecture of SCTNet. CFBlock (Conv-Former Block, detailed in Figure \ref{['fig:figure4_ConvFormerBlock']}) takes advantage of the training-only Transformer branch (greyed-out in the dashed box) via SIAM (Semantic Information Alignment Module) which is composed of BFA (Backbone Feature Alignment) and SDHA (Shared Decoder Head Alignment).
Figure 4: Design of Conv-Former Block (left) and the details of convolutional attention (right). GDN means Grouped Double Normalization. $\otimes$ means convolution operations, $\oplus$ stands for addition, and $k$ means the kernel size.
Figure 5: Visualization results on Cityscapes validation set. Compared with DDRNet-23pan2022deep and RTFormer-B wang2022rtformer, SCTNet-B generates masks with finer details as highlighted in the light blue box and more accurate large-area predictions, as highlighted in the yellow box.
...and 5 more figures

SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation

TL;DR

Abstract

SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)