Table of Contents
Fetching ...

Proposal-free Network for Instance-level Object Segmentation

Xiaodan Liang, Yunchao Wei, Xiaohui Shen, Jianchao Yang, Liang Lin, Shuicheng Yan

TL;DR

PFN addresses the challenging task of instance-level segmentation without region proposals by jointly predicting category-level masks, per-pixel instance location vectors, and per-category instance counts. It introduces multi-scale instance location prediction with coordinate maps and uses spectral clustering to convert pixel-level outputs into instance masks, aided by a category-level protection to ensure consistency. The approach is trained in two stages (category-level followed by instance-level) and demonstrates substantial performance gains on PASCAL VOC 2012, achieving $AP^r$ of $58.7\%$ at 0.5 IoU, far surpassing prior proposal-based methods. This work offers a faster, simpler, and more scalable alternative for accurate instance segmentation with significant practical impact for downstream vision tasks.

Abstract

Instance-level object segmentation is an important yet under-explored task. The few existing studies are almost all based on region proposal methods to extract candidate segments and then utilize object classification to produce final results. Nonetheless, generating accurate region proposals itself is quite challenging. In this work, we propose a Proposal-Free Network (PFN ) to address the instance-level object segmentation problem, which outputs the instance numbers of different categories and the pixel-level information on 1) the coordinates of the instance bounding box each pixel belongs to, and 2) the confidences of different categories for each pixel, based on pixel-to-pixel deep convolutional neural network. All the outputs together, by using any off-the-shelf clustering method for simple post-processing, can naturally generate the ultimate instance-level object segmentation results. The whole PFN can be easily trained in an end-to-end way without the requirement of a proposal generation stage. Extensive evaluations on the challenging PASCAL VOC 2012 semantic segmentation benchmark demonstrate that the proposed PFN solution well beats the state-of-the-arts for instance-level object segmentation. In particular, the $AP^r$ over 20 classes at 0.5 IoU reaches 58.7% by PFN, significantly higher than 43.8% and 46.3% by the state-of-the-art algorithms, SDS [9] and [16], respectively.

Proposal-free Network for Instance-level Object Segmentation

TL;DR

PFN addresses the challenging task of instance-level segmentation without region proposals by jointly predicting category-level masks, per-pixel instance location vectors, and per-category instance counts. It introduces multi-scale instance location prediction with coordinate maps and uses spectral clustering to convert pixel-level outputs into instance masks, aided by a category-level protection to ensure consistency. The approach is trained in two stages (category-level followed by instance-level) and demonstrates substantial performance gains on PASCAL VOC 2012, achieving of at 0.5 IoU, far surpassing prior proposal-based methods. This work offers a faster, simpler, and more scalable alternative for accurate instance segmentation with significant practical impact for downstream vision tasks.

Abstract

Instance-level object segmentation is an important yet under-explored task. The few existing studies are almost all based on region proposal methods to extract candidate segments and then utilize object classification to produce final results. Nonetheless, generating accurate region proposals itself is quite challenging. In this work, we propose a Proposal-Free Network (PFN ) to address the instance-level object segmentation problem, which outputs the instance numbers of different categories and the pixel-level information on 1) the coordinates of the instance bounding box each pixel belongs to, and 2) the confidences of different categories for each pixel, based on pixel-to-pixel deep convolutional neural network. All the outputs together, by using any off-the-shelf clustering method for simple post-processing, can naturally generate the ultimate instance-level object segmentation results. The whole PFN can be easily trained in an end-to-end way without the requirement of a proposal generation stage. Extensive evaluations on the challenging PASCAL VOC 2012 semantic segmentation benchmark demonstrate that the proposed PFN solution well beats the state-of-the-arts for instance-level object segmentation. In particular, the over 20 classes at 0.5 IoU reaches 58.7% by PFN, significantly higher than 43.8% and 46.3% by the state-of-the-art algorithms, SDS [9] and [16], respectively.

Paper Structure

This paper contains 12 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Exemplar instance-level object segmentation results. For each image, the category-level segmentation results, predicted instance locations for all foreground pixels and instance-level segmentation results are sequentially shown in each row. Different colors indicate the different object instances for each category. To better show the predicted instance locations, we plot velocity vectors that start from each pixel to its corresponding predicted instance center as shown by the arrow. Note that the pixels predicting similar object centers can be directly collected as one instance region. Best view in color and scale up three times.
  • Figure 2: The proposal-free network overview. Our network predicts the instance numbers of all categories and the pixel-level information that includes the category-level confidences for each pixel and the coordinates of the instance bounding box each pixel belongs to. The instance location prediction for each pixel involves the coordinates of center, top-left corner and bottom-right corner of the object instance that a specific pixel belongs to. Any off-the-self clustering method can be utilized to generate ultimate instance-level segmentation results.
  • Figure 3: The detailed network architecture and parameter setting of PFN. First, the category-level segmentation network is fine-tuned based on the pre-trained VGG-16 classification network. The cross-entropy loss is used for optimization. Second, the instance-level segmentation network that simultaneously predicts the instance numbers of all categories and the instance location vector for each pixel is further fine-tuned. The multi-scale prediction streams (with different resolution and reception fields) are appended to the intermediate convolutional layers, and are then fused to generate final instance location predictions. During each stream, we incorporate the corresponding coordinates (i.e. x and y dimension) of each pixel as the feature maps in the second convolutional layer with 130 = 128 + 2 channels. The regression loss is used during training. To predict instance numbers, the convolutional feature maps and the instance location maps are concatenated together for inference, and the Euclidean loss is used. The two losses from two targets are jointly optimized for the whole network training.
  • Figure 4: The exemplar segmentation results by refining the category-level segmentation with the predicted instance numbers. For each image, we show their classification results inferred from category-level segmentation and the predicted instance numbers in the left. In the first row, the refining strategy is to convert the inconsistent predicted labels into background. In the second row, the refining strategy is to convert the wrongly predicted labels in category-level segmentation to the ones predicted in the instance number vector. Different colors indicate different object instances. Better viewed in zoomed-in color pdf file.
  • Figure 5: Comparison of segmentation results by constraining the pixel number of each clustered object instance.
  • ...and 2 more figures