Pixel-level Encoding and Depth Layering for Instance-level Semantic Labeling
Jonas Uhrig, Marius Cordts, Uwe Franke, Thomas Brox
TL;DR
This work tackles instance-level semantic labeling in urban street scenes from monocular imagery by introducing a fully convolutional network that outputs three per-pixel channels: semantic class, depth class, and direction toward the visible center of each instance. A lightweight post-processing pipeline uses template matching on the direction map and CV-based fusion to generate accurate, proposal-free instance segmentation, while also providing per-instance depth estimates. The approach achieves state-of-the-art results on KITTI and Cityscapes for instance segmentation and competitive pixel-level semantics, with robust monocular depth estimation demonstrated by low MAE/RMSE and high delta accuracy. Overall, the method offers a scalable, end-to-end framework for joint semantic labeling, instance segmentation, and depth estimation without region proposals, enabling improved autonomy-relevant scene understanding.
Abstract
Recent approaches for instance-aware semantic labeling have augmented convolutional neural networks (CNNs) with complex multi-task architectures or computationally expensive graphical models. We present a method that leverages a fully convolutional network (FCN) to predict semantic labels, depth and an instance-based encoding using each pixel's direction towards its corresponding instance center. Subsequently, we apply low-level computer vision techniques to generate state-of-the-art instance segmentation on the street scene datasets KITTI and Cityscapes. Our approach outperforms existing works by a large margin and can additionally predict absolute distances of individual instances from a monocular image as well as a pixel-level semantic labeling.
