Table of Contents
Fetching ...

Feedforward semantic segmentation with zoom-out features

Mohammadreza Mostajabi, Payman Yadollahpour, Gregory Shakhnarovich

TL;DR

Addresses semantic segmentation by reframing it as a per-superpixel classification problem using multi-level zoom-out features.Combines local, proximal, distant, and global context extracted via handcrafted features and pre-trained CNNs, fed to a feed-forward classifier with asymmetric loss to handle class imbalance.Achieves state-of-the-art performance on VOC 2012 test with mean IoU of 64.4%, demonstrating effective context modeling across multiple spatial scales without explicit structured prediction.Suggests that deep representations can be leveraged in a purely feed-forward framework, while leaving room for end-to-end training and selective integration with inference-based approaches.

Abstract

We introduce a purely feed-forward architecture for semantic segmentation. We map small image elements (superpixels) to rich feature representations extracted from a sequence of nested regions of increasing extent. These regions are obtained by "zooming out" from the superpixel all the way to scene-level resolution. This approach exploits statistical structure in the image and in the label space without setting up explicit structured prediction mechanisms, and thus avoids complex and expensive inference. Instead superpixels are classified by a feedforward multilayer network. Our architecture achieves new state of the art performance in semantic segmentation, obtaining 64.4% average accuracy on the PASCAL VOC 2012 test set.

Feedforward semantic segmentation with zoom-out features

TL;DR

Addresses semantic segmentation by reframing it as a per-superpixel classification problem using multi-level zoom-out features.Combines local, proximal, distant, and global context extracted via handcrafted features and pre-trained CNNs, fed to a feed-forward classifier with asymmetric loss to handle class imbalance.Achieves state-of-the-art performance on VOC 2012 test with mean IoU of 64.4%, demonstrating effective context modeling across multiple spatial scales without explicit structured prediction.Suggests that deep representations can be leveraged in a purely feed-forward framework, while leaving room for end-to-end training and selective integration with inference-based approaches.

Abstract

We introduce a purely feed-forward architecture for semantic segmentation. We map small image elements (superpixels) to rich feature representations extracted from a sequence of nested regions of increasing extent. These regions are obtained by "zooming out" from the superpixel all the way to scene-level resolution. This approach exploits statistical structure in the image and in the label space without setting up explicit structured prediction mechanisms, and thus avoids complex and expensive inference. Instead superpixels are classified by a feedforward multilayer network. Our architecture achieves new state of the art performance in semantic segmentation, obtaining 64.4% average accuracy on the PASCAL VOC 2012 test set.

Paper Structure

This paper contains 19 sections, 2 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Our feedforward segmentation process. The feature vector for a superpixel consists of components extracted at zoom-out spatial levels: locally at a superpixel (red), in a small proximal neighborhood (cyan), in a larger distant neighborhood (orange), and globally from the entire image (green). The concatenated feature vector is fed to a multi-layer neural network that classifies the superpixel.
  • Figure 2: Examples of zoom-out regions: red for superpixel, cyan for proximal region, solid orange for distant region; curved orange line shows the extent of the radius-3 neighborhood on which the distant region is based. The global region is always the entire image. The image on the left shows superpixel boundaries in black; these are typical. Distant regions tend to enclose large portions of objects. Proximal regions are more likely to include moderate portions of objects, and both often include surrounding objects/background as well.
  • Figure 3: Showing three superpixels in each image (top), followed by corresponding zoom-out regions that are seen by the segmenation process, (left) superpixel, (center) proximal region, (right) distant. As we zoom in from image to the superpixel level, it becomes increasingly hard to tell what we are looking at, however the higher zoom-out levels provide rich contextual information.
  • Figure 4: Color code for VOC categories. Background is black.
  • Figure 5: Examples illustrating the effect of zoom-out levels. From left: original image, ground truth, local only, local and proximal, local, proximal and global, and the full (four levels) set of zoom-out features. In all cases a linear model is used to label superpixels. See Figure \ref{['fig:colorcode']} for category color code.
  • ...and 2 more figures