Table of Contents
Fetching ...

DSNet: A Novel Way to Use Atrous Convolutions in Semantic Segmentation

Zilu Guo, Liuyang Bian, Xuan Huang, Hu Wei, Jingyu Li, Huasheng Ni

TL;DR

The paper tackles the challenge of balancing segmentation accuracy and inference speed by rethinking how atrous convolutions are applied. It introduces DSNet, a Dual-Branch, same-resolution network that leverages shallow atrous blocks (MFACB) and dense 3×3 convolutions, linked via a Multi-Scale Attention Fusion (MSAF) mechanism, with SPASPP outside the backbone to broaden context. Three empirical guidelines guide atrous usage: avoid relying solely on atrous convolutions, avoid excessively large atrous rates that hinder pretraining, and employ effective fusion to balance details and context. DSNet achieves state-of-the-art speed–accuracy trade-offs on ADE20K ($40.0\%$ mIOU at $179.2$ FPS) and Cityscapes ($80.4\%$ mIOU at $81.9$ FPS), with DSNet-Base delivering even stronger non-real-time accuracy; code is available at the provided GitHub repository.

Abstract

Atrous convolutions are employed as a method to increase the receptive field in semantic segmentation tasks. However, in previous works of semantic segmentation, it was rarely employed in the shallow layers of the model. We revisit the design of atrous convolutions in modern convolutional neural networks (CNNs), and demonstrate that the concept of using large kernels to apply atrous convolutions could be a more powerful paradigm. We propose three guidelines to apply atrous convolutions more efficiently. Following these guidelines, we propose DSNet, a Dual-Branch CNN architecture, which incorporates atrous convolutions in the shallow layers of the model architecture, as well as pretraining the nearly entire encoder on ImageNet to achieve better performance. To demonstrate the effectiveness of our approach, our models achieve a new state-of-the-art trade-off between accuracy and speed on ADE20K, Cityscapes and BDD datasets. Specifically, DSNet achieves 40.0% mIOU with inference speed of 179.2 FPS on ADE20K, and 80.4% mIOU with speed of 81.9 FPS on Cityscapes. Source code and models are available at Github: https://github.com/takaniwa/DSNet.

DSNet: A Novel Way to Use Atrous Convolutions in Semantic Segmentation

TL;DR

The paper tackles the challenge of balancing segmentation accuracy and inference speed by rethinking how atrous convolutions are applied. It introduces DSNet, a Dual-Branch, same-resolution network that leverages shallow atrous blocks (MFACB) and dense 3×3 convolutions, linked via a Multi-Scale Attention Fusion (MSAF) mechanism, with SPASPP outside the backbone to broaden context. Three empirical guidelines guide atrous usage: avoid relying solely on atrous convolutions, avoid excessively large atrous rates that hinder pretraining, and employ effective fusion to balance details and context. DSNet achieves state-of-the-art speed–accuracy trade-offs on ADE20K ( mIOU at FPS) and Cityscapes ( mIOU at FPS), with DSNet-Base delivering even stronger non-real-time accuracy; code is available at the provided GitHub repository.

Abstract

Atrous convolutions are employed as a method to increase the receptive field in semantic segmentation tasks. However, in previous works of semantic segmentation, it was rarely employed in the shallow layers of the model. We revisit the design of atrous convolutions in modern convolutional neural networks (CNNs), and demonstrate that the concept of using large kernels to apply atrous convolutions could be a more powerful paradigm. We propose three guidelines to apply atrous convolutions more efficiently. Following these guidelines, we propose DSNet, a Dual-Branch CNN architecture, which incorporates atrous convolutions in the shallow layers of the model architecture, as well as pretraining the nearly entire encoder on ImageNet to achieve better performance. To demonstrate the effectiveness of our approach, our models achieve a new state-of-the-art trade-off between accuracy and speed on ADE20K, Cityscapes and BDD datasets. Specifically, DSNet achieves 40.0% mIOU with inference speed of 179.2 FPS on ADE20K, and 80.4% mIOU with speed of 81.9 FPS on Cityscapes. Source code and models are available at Github: https://github.com/takaniwa/DSNet.
Paper Structure (15 sections, 5 equations, 4 figures, 10 tables)

This paper contains 15 sections, 5 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Overview of DSNet. MFACB, MSAF, and SPASPP denotes Multi-scale Fusion Atrous Convolutional Block, Multi-Scale Attention Fusion Module, and Serial-Parallel Atrous Spatial Pyramid Pooling, respectively. UP indicates upsample, and CAT indicates Concatenate. C = 32.
  • Figure 2: Diagram of Multi-Scale Fusion Atrous Convolutional Block (MFACB). Where C represents the number of channels, and r = a indicates the atrous rate = a.
  • Figure 3: MSA and MSAF schematic diagram. AvgPool(4) denotes global average pooling to $4\times4$, $\sigma$ represents the sigmoid function. Unpool represents average unpooling.
  • Figure 4: Illustration of SPASPP module.