DSNet: A Novel Way to Use Atrous Convolutions in Semantic Segmentation

Zilu Guo; Liuyang Bian; Xuan Huang; Hu Wei; Jingyu Li; Huasheng Ni

DSNet: A Novel Way to Use Atrous Convolutions in Semantic Segmentation

Zilu Guo, Liuyang Bian, Xuan Huang, Hu Wei, Jingyu Li, Huasheng Ni

TL;DR

The paper tackles the challenge of balancing segmentation accuracy and inference speed by rethinking how atrous convolutions are applied. It introduces DSNet, a Dual-Branch, same-resolution network that leverages shallow atrous blocks (MFACB) and dense 3×3 convolutions, linked via a Multi-Scale Attention Fusion (MSAF) mechanism, with SPASPP outside the backbone to broaden context. Three empirical guidelines guide atrous usage: avoid relying solely on atrous convolutions, avoid excessively large atrous rates that hinder pretraining, and employ effective fusion to balance details and context. DSNet achieves state-of-the-art speed–accuracy trade-offs on ADE20K ($40.0\%$ mIOU at $179.2$ FPS) and Cityscapes ($80.4\%$ mIOU at $81.9$ FPS), with DSNet-Base delivering even stronger non-real-time accuracy; code is available at the provided GitHub repository.

Abstract

Atrous convolutions are employed as a method to increase the receptive field in semantic segmentation tasks. However, in previous works of semantic segmentation, it was rarely employed in the shallow layers of the model. We revisit the design of atrous convolutions in modern convolutional neural networks (CNNs), and demonstrate that the concept of using large kernels to apply atrous convolutions could be a more powerful paradigm. We propose three guidelines to apply atrous convolutions more efficiently. Following these guidelines, we propose DSNet, a Dual-Branch CNN architecture, which incorporates atrous convolutions in the shallow layers of the model architecture, as well as pretraining the nearly entire encoder on ImageNet to achieve better performance. To demonstrate the effectiveness of our approach, our models achieve a new state-of-the-art trade-off between accuracy and speed on ADE20K, Cityscapes and BDD datasets. Specifically, DSNet achieves 40.0% mIOU with inference speed of 179.2 FPS on ADE20K, and 80.4% mIOU with speed of 81.9 FPS on Cityscapes. Source code and models are available at Github: https://github.com/takaniwa/DSNet.

DSNet: A Novel Way to Use Atrous Convolutions in Semantic Segmentation

TL;DR

mIOU at

FPS) and Cityscapes (

mIOU at

FPS), with DSNet-Base delivering even stronger non-real-time accuracy; code is available at the provided GitHub repository.

Abstract

Paper Structure (15 sections, 5 equations, 4 figures, 10 tables)

This paper contains 15 sections, 5 equations, 4 figures, 10 tables.

Related Work
High-Precision Semantic Segmentation
Real-Time Semantic Segmentation
Method
Network design
DSNet: A novel Dual-Branch Network
MFACB: Learning of different scales.
MSAF: Balancing the Details and Contexts
SPASPP: Further extracting context information
Experiment
Dataset
Implementation Details
Ablation Study
Comparison
Conclusion

Figures (4)

Figure 1: Overview of DSNet. MFACB, MSAF, and SPASPP denotes Multi-scale Fusion Atrous Convolutional Block, Multi-Scale Attention Fusion Module, and Serial-Parallel Atrous Spatial Pyramid Pooling, respectively. UP indicates upsample, and CAT indicates Concatenate. C = 32.
Figure 2: Diagram of Multi-Scale Fusion Atrous Convolutional Block (MFACB). Where C represents the number of channels, and r = a indicates the atrous rate = a.
Figure 3: MSA and MSAF schematic diagram. AvgPool(4) denotes global average pooling to $4\times4$, $\sigma$ represents the sigmoid function. Unpool represents average unpooling.
Figure 4: Illustration of SPASPP module.

DSNet: A Novel Way to Use Atrous Convolutions in Semantic Segmentation

TL;DR

Abstract

DSNet: A Novel Way to Use Atrous Convolutions in Semantic Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)