Table of Contents
Fetching ...

Multi-Scale Semantic Segmentation with Modified MBConv Blocks

Xi Chen, Yang Cai, Yuan Wu, Bo Xiong, Taesung Park

TL;DR

This work addresses semantic segmentation by adapting MBConv blocks, originally designed for classification, to preserve fine spatial details. It introduces a multi-scale segmentation framework where every U-Net branch retains equal depth and capacity, aided by a stem module that downscales inputs to control memory. The core methodological innovations are replacing 1×1 convolutions with 3×3 convolutions in MBConv blocks to strengthen spatial context and maintaining high-resolution branches with comparable learning power to low-resolution ones. Experiments on Mapillary Vistas and Cityscapes demonstrate state-of-the-art mean IoU on Cityscapes ($84.58\%$), validating the effectiveness of the proposed approaches, underscored by ablation studies of each component and their combination.

Abstract

Recently, MBConv blocks, initially designed for efficiency in resource-limited settings and later adapted for cutting-edge image classification performances, have demonstrated significant potential in image classification tasks. Despite their success, their application in semantic segmentation has remained relatively unexplored. This paper introduces a novel adaptation of MBConv blocks specifically tailored for semantic segmentation. Our modification stems from the insight that semantic segmentation requires the extraction of more detailed spatial information than image classification. We argue that to effectively perform multi-scale semantic segmentation, each branch of a U-Net architecture, regardless of its resolution, should possess equivalent segmentation capabilities. By implementing these changes, our approach achieves impressive mean Intersection over Union (IoU) scores of 84.5% and 84.0% on the Cityscapes test and validation datasets, respectively, demonstrating the efficacy of our proposed modifications in enhancing semantic segmentation performance.

Multi-Scale Semantic Segmentation with Modified MBConv Blocks

TL;DR

This work addresses semantic segmentation by adapting MBConv blocks, originally designed for classification, to preserve fine spatial details. It introduces a multi-scale segmentation framework where every U-Net branch retains equal depth and capacity, aided by a stem module that downscales inputs to control memory. The core methodological innovations are replacing 1×1 convolutions with 3×3 convolutions in MBConv blocks to strengthen spatial context and maintaining high-resolution branches with comparable learning power to low-resolution ones. Experiments on Mapillary Vistas and Cityscapes demonstrate state-of-the-art mean IoU on Cityscapes (), validating the effectiveness of the proposed approaches, underscored by ablation studies of each component and their combination.

Abstract

Recently, MBConv blocks, initially designed for efficiency in resource-limited settings and later adapted for cutting-edge image classification performances, have demonstrated significant potential in image classification tasks. Despite their success, their application in semantic segmentation has remained relatively unexplored. This paper introduces a novel adaptation of MBConv blocks specifically tailored for semantic segmentation. Our modification stems from the insight that semantic segmentation requires the extraction of more detailed spatial information than image classification. We argue that to effectively perform multi-scale semantic segmentation, each branch of a U-Net architecture, regardless of its resolution, should possess equivalent segmentation capabilities. By implementing these changes, our approach achieves impressive mean Intersection over Union (IoU) scores of 84.5% and 84.0% on the Cityscapes test and validation datasets, respectively, demonstrating the efficacy of our proposed modifications in enhancing semantic segmentation performance.
Paper Structure (16 sections, 3 figures, 5 tables)

This paper contains 16 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The higher resolution feature maps show that these branches are able to segment the smaller objects in the images (the context can affect the final class). This observation shows that the higher resolution branches need to have the same learning power as the lower resolution ones, since they need to classify and segment similar number of classes and objects.
  • Figure 2: The proposed modifications. Left: Our modified U-Net. All the branches have the same depth and number of channels. The residual blocks are replaced with our modified MBConv blocks. Right: Our modified MBConv block. The $1 \times 1$ convolutions are replaced with $3 \times 3$ convolutions.
  • Figure 3: Sample qualitative results from the Cityscapes cordts2016cityscapes validation set. From left to right: input image, ground truth, prediction, prediction overlaid on the input image, and the segmentation error.