Bridging Category-level and Instance-level Semantic Image Segmentation
Zifeng Wu, Chunhua Shen, Anton van den Hengel
TL;DR
The paper presents a pipeline that builds instance-level segmentation on top of strong category-level semantic segmentation by learning per-pixel bounding-box transforms and using a Hough-like maxima search, rather than relying on bounding-box proposals. It introduces online bootstrapping to focus learning on hard pixels and develops a Fully Convolutional Residual Network that achieves state-of-the-art semantic segmentation and competitive instance segmentation on standard benchmarks. Key findings show that deeper, higher-resolution features with large field-of-view and effective hard-pixel sampling markedly improve IoU and AP metrics. The approach offers a practical, proposal-free alternative to traditional detect-then-segment methods with robust performance across VOC 2012, Cityscapes, and Pascal-Context datasets.
Abstract
We propose an approach to instance-level image segmentation that is built on top of category-level segmentation. Specifically, for each pixel in a semantic category mask, its corresponding instance bounding box is predicted using a deep fully convolutional regression network. Thus it follows a different pipeline to the popular detect-then-segment approaches that first predict instances' bounding boxes, which are the current state-of-the-art in instance segmentation. We show that, by leveraging the strength of our state-of-the-art semantic segmentation models, the proposed method can achieve comparable or even better results to detect-then-segment approaches. We make the following contributions. (i) First, we propose a simple yet effective approach to semantic instance segmentation. (ii) Second, we propose an online bootstrapping method during training, which is critically important for achieving good performance for both semantic category segmentation and instance-level segmentation. (iii) As the performance of semantic category segmentation has a significant impact on the instance-level segmentation, which is the second step of our approach, we train fully convolutional residual networks to achieve the best semantic category segmentation accuracy. On the PASCAL VOC 2012 dataset, we obtain the currently best mean intersection-over-union score of 79.1%. (iv) We also achieve state-of-the-art results for instance-level segmentation.
