Table of Contents
Fetching ...

Learning Deconvolution Network for Semantic Segmentation

Hyeonwoo Noh, Seunghoon Hong, Bohyung Han

TL;DR

This work addresses the limitations of fully convolutional network approaches in semantic segmentation, particularly with respect to scale variation and preserving fine object details. It introduces a deep deconvolution network built on a VGG-16 backbone, employing unpooling and learned deconvolution to produce dense, pixel-level class maps for object proposals, which are then aggregated to form a full-image segmentation. The authors demonstrate strong performance on the PASCAL VOC 2012 benchmark, achieving competitive results and, when ensembled with FCN methods, state-of-the-art performance among VOC-only models. The approach highlights the complementary strengths of proposal-based, detail-preserving segmentation and coarse, context-driven FCN methods, offering practical improvements for multi-scale segmentation tasks.

Abstract

We propose a novel semantic segmentation algorithm by learning a deconvolution network. We learn the network on top of the convolutional layers adopted from VGG 16-layer net. The deconvolution network is composed of deconvolution and unpooling layers, which identify pixel-wise class labels and predict segmentation masks. We apply the trained network to each proposal in an input image, and construct the final semantic segmentation map by combining the results from all proposals in a simple manner. The proposed algorithm mitigates the limitations of the existing methods based on fully convolutional networks by integrating deep deconvolution network and proposal-wise prediction; our segmentation method typically identifies detailed structures and handles objects in multiple scales naturally. Our network demonstrates outstanding performance in PASCAL VOC 2012 dataset, and we achieve the best accuracy (72.5%) among the methods trained with no external data through ensemble with the fully convolutional network.

Learning Deconvolution Network for Semantic Segmentation

TL;DR

This work addresses the limitations of fully convolutional network approaches in semantic segmentation, particularly with respect to scale variation and preserving fine object details. It introduces a deep deconvolution network built on a VGG-16 backbone, employing unpooling and learned deconvolution to produce dense, pixel-level class maps for object proposals, which are then aggregated to form a full-image segmentation. The authors demonstrate strong performance on the PASCAL VOC 2012 benchmark, achieving competitive results and, when ensembled with FCN methods, state-of-the-art performance among VOC-only models. The approach highlights the complementary strengths of proposal-based, detail-preserving segmentation and coarse, context-driven FCN methods, offering practical improvements for multi-scale segmentation tasks.

Abstract

We propose a novel semantic segmentation algorithm by learning a deconvolution network. We learn the network on top of the convolutional layers adopted from VGG 16-layer net. The deconvolution network is composed of deconvolution and unpooling layers, which identify pixel-wise class labels and predict segmentation masks. We apply the trained network to each proposal in an input image, and construct the final semantic segmentation map by combining the results from all proposals in a simple manner. The proposed algorithm mitigates the limitations of the existing methods based on fully convolutional networks by integrating deep deconvolution network and proposal-wise prediction; our segmentation method typically identifies detailed structures and handles objects in multiple scales naturally. Our network demonstrates outstanding performance in PASCAL VOC 2012 dataset, and we achieve the best accuracy (72.5%) among the methods trained with no external data through ensemble with the fully convolutional network.

Paper Structure

This paper contains 24 sections, 2 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Limitations of semantic segmentation algorithms based on fully convolutional network. (Left) original image. (Center) ground-truth annotation. (Right) segmentations by Fcn
  • Figure 2: Overall architecture of the proposed network. On top of the convolution network based on VGG 16-layer net, we put a multi-layer deconvolution network to generate the accurate segmentation map of an input proposal. Given a feature representation obtained from the convolution network, dense pixel-wise class prediction map is constructed through multiple series of unpooling, deconvolution and rectification operations.
  • Figure 3: Illustration of deconvolution and unpooling operations.
  • Figure 4: Visualization of activations in our deconvolution network. The activation maps from top left to bottom right correspond to the output maps from lower to higher layers in the deconvolution network. We select the most representative activation in each layer for effective visualization. The image in (a) is an input, and the rest are the outputs from (b) the last $14 \times 14$ deconvolutional layer, (c) the $28 \times 28$ unpooling layer, (d) the last $28 \times 28$ deconvolutional layer, (e) the $56 \times 56$ unpooling layer, (f) the last $56 \times 56$ deconvolutional layer, (g) the $112 \times 112$ unpooling layer, (h) the last $112 \times 112$ deconvolutional layer, (i) the $224 \times 224$ unpooling layer and (j) the last $224 \times 224$ deconvolutional layer. The finer details of the object are revealed, as the features are forward-propagated through the layers in the deconvolution network. Note that noisy activations from background are suppressed through propagation while the activations closely related to the target classes are amplified. It shows that the learned filters in higher deconvolutional layers tend to capture class-specific shape information.
  • Figure 5: Comparison of class conditional probability maps from FCN and our network (top: dog, bottom: bicycle).
  • ...and 2 more figures