Table of Contents
Fetching ...

Semantic Instance Segmentation with a Discriminative Loss Function

Bert De Brabandere, Davy Neven, Luc Van Gool

TL;DR

This paper tackles semantic instance segmentation by learning pixel-level embeddings through a discriminative loss that couples intra-instance compactness with inter-instance separation. The embeddings are designed to be clusterable with simple post-processing, avoiding object proposals and recurrent architectures. It demonstrates competitive results on Cityscapes and CVPPP, and argues that holistic image reasoning can better handle occlusions than detect-and-segment pipelines. The findings suggest practical benefits for integrating instance segmentation with standard semantic segmentation architectures and pave the way for joint training in future work.

Abstract

Semantic instance segmentation remains a challenging task. In this work we propose to tackle the problem with a discriminative loss function, operating at the pixel level, that encourages a convolutional network to produce a representation of the image that can easily be clustered into instances with a simple post-processing step. The loss function encourages the network to map each pixel to a point in feature space so that pixels belonging to the same instance lie close together while different instances are separated by a wide margin. Our approach of combining an off-the-shelf network with a principled loss function inspired by a metric learning objective is conceptually simple and distinct from recent efforts in instance segmentation. In contrast to previous works, our method does not rely on object proposals or recurrent mechanisms. A key contribution of our work is to demonstrate that such a simple setup without bells and whistles is effective and can perform on par with more complex methods. Moreover, we show that it does not suffer from some of the limitations of the popular detect-and-segment approaches. We achieve competitive performance on the Cityscapes and CVPPP leaf segmentation benchmarks.

Semantic Instance Segmentation with a Discriminative Loss Function

TL;DR

This paper tackles semantic instance segmentation by learning pixel-level embeddings through a discriminative loss that couples intra-instance compactness with inter-instance separation. The embeddings are designed to be clusterable with simple post-processing, avoiding object proposals and recurrent architectures. It demonstrates competitive results on Cityscapes and CVPPP, and argues that holistic image reasoning can better handle occlusions than detect-and-segment pipelines. The findings suggest practical benefits for integrating instance segmentation with standard semantic segmentation architectures and pave the way for joint training in future work.

Abstract

Semantic instance segmentation remains a challenging task. In this work we propose to tackle the problem with a discriminative loss function, operating at the pixel level, that encourages a convolutional network to produce a representation of the image that can easily be clustered into instances with a simple post-processing step. The loss function encourages the network to map each pixel to a point in feature space so that pixels belonging to the same instance lie close together while different instances are separated by a wide margin. Our approach of combining an off-the-shelf network with a principled loss function inspired by a metric learning objective is conceptually simple and distinct from recent efforts in instance segmentation. In contrast to previous works, our method does not rely on object proposals or recurrent mechanisms. A key contribution of our work is to demonstrate that such a simple setup without bells and whistles is effective and can perform on par with more complex methods. Moreover, we show that it does not suffer from some of the limitations of the popular detect-and-segment approaches. We achieve competitive performance on the Cityscapes and CVPPP leaf segmentation benchmarks.

Paper Structure

This paper contains 13 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The network maps each pixel to a point in feature space so that pixels belonging to the same instance are close to each other, and can easily be clustered with a fast post-processing step. From top to bottom, left to right: input image, output of the network, pixel embeddings in 2-dimensional feature space, clustered image.
  • Figure 2: The intra-cluster pulling force pulls embeddings towards the cluster center, i.e. the mean embedding of that cluster. The inter-cluster repelling force pushes cluster centers away from each other. Both forces are hinged: they are only active up to a certain distance determined by the margins $\delta_v$ and $\delta_d$, denoted by the dotted circles. This diagram is inspired by a similar one in weinberger2009distance.
  • Figure 3: Convergence of our method on a single image in a 2-dimensional feature space. Left: input and ground truth label. The middle row shows the raw output of the network (as the R- and G- channels of an RGB image), masked with the foreground mask. The upper row shows each of the pixel embeddings $x_i$ in 2-d feature space, colored corresponding to their ground truth label. The cluster center $\mu_c$ and margins $\delta_v$ and $\delta_d$ are also drawn. The last row shows the result of clustering the embeddings by thresholding around their cluster center, as explained in section \ref{['sec:postprocessing']}. We display the images after 0, 2, 4, 8, 16, 32 and 64 gradient update steps.
  • Figure 4: Results on the synthetic scattered sticks dataset to illustrate that our approach is a good fit for problems with complex occlusions.
  • Figure 5: Some visual examples on the CVPPP leaf dataset.
  • ...and 1 more figures