Table of Contents
Fetching ...

Pixel-level Encoding and Depth Layering for Instance-level Semantic Labeling

Jonas Uhrig, Marius Cordts, Uwe Franke, Thomas Brox

TL;DR

This work tackles instance-level semantic labeling in urban street scenes from monocular imagery by introducing a fully convolutional network that outputs three per-pixel channels: semantic class, depth class, and direction toward the visible center of each instance. A lightweight post-processing pipeline uses template matching on the direction map and CV-based fusion to generate accurate, proposal-free instance segmentation, while also providing per-instance depth estimates. The approach achieves state-of-the-art results on KITTI and Cityscapes for instance segmentation and competitive pixel-level semantics, with robust monocular depth estimation demonstrated by low MAE/RMSE and high delta accuracy. Overall, the method offers a scalable, end-to-end framework for joint semantic labeling, instance segmentation, and depth estimation without region proposals, enabling improved autonomy-relevant scene understanding.

Abstract

Recent approaches for instance-aware semantic labeling have augmented convolutional neural networks (CNNs) with complex multi-task architectures or computationally expensive graphical models. We present a method that leverages a fully convolutional network (FCN) to predict semantic labels, depth and an instance-based encoding using each pixel's direction towards its corresponding instance center. Subsequently, we apply low-level computer vision techniques to generate state-of-the-art instance segmentation on the street scene datasets KITTI and Cityscapes. Our approach outperforms existing works by a large margin and can additionally predict absolute distances of individual instances from a monocular image as well as a pixel-level semantic labeling.

Pixel-level Encoding and Depth Layering for Instance-level Semantic Labeling

TL;DR

This work tackles instance-level semantic labeling in urban street scenes from monocular imagery by introducing a fully convolutional network that outputs three per-pixel channels: semantic class, depth class, and direction toward the visible center of each instance. A lightweight post-processing pipeline uses template matching on the direction map and CV-based fusion to generate accurate, proposal-free instance segmentation, while also providing per-instance depth estimates. The approach achieves state-of-the-art results on KITTI and Cityscapes for instance segmentation and competitive pixel-level semantics, with robust monocular depth estimation demonstrated by low MAE/RMSE and high delta accuracy. Overall, the method offers a scalable, end-to-end framework for joint semantic labeling, instance segmentation, and depth estimation without region proposals, enabling improved autonomy-relevant scene understanding.

Abstract

Recent approaches for instance-aware semantic labeling have augmented convolutional neural networks (CNNs) with complex multi-task architectures or computationally expensive graphical models. We present a method that leverages a fully convolutional network (FCN) to predict semantic labels, depth and an instance-based encoding using each pixel's direction towards its corresponding instance center. Subsequently, we apply low-level computer vision techniques to generate state-of-the-art instance segmentation on the street scene datasets KITTI and Cityscapes. Our approach outperforms existing works by a large margin and can additionally predict absolute distances of individual instances from a monocular image as well as a pixel-level semantic labeling.

Paper Structure

This paper contains 19 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Example scene representation as obtained by our method: instance segmentation, monocular depth estimation, and pixel-level semantic labeling.
  • Figure 1: Further example results of our instance segmentation (right) and corresponding ground truth (middle) on KITTI.
  • Figure 2: From a single image, we predict 3.0 FCN outputs: semantics, depth, and instance center direction. Those are used to compute template matching score maps for semantic categories. Using these, we locate and generate instance proposals and fuse them to obtain our instance segmentation.
  • Figure 2: Further example results of our instance segmentation (right) and corresponding ground truth (center) on Cityscapes validation.
  • Figure 3: Ground truth examples of our three proposed FCN channels. Color overlay (a) as suggested by Cordts2015, (b) represents depth per object from red (close) to blue (distant), (c) represents directions towards corresponding instance centers.
  • ...and 2 more figures