Table of Contents
Fetching ...

Panoptic Segmentation

Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, Piotr Dollár

TL;DR

Panoptic segmentation poses a unified vision task that combines semantic and instance segmentation into a single coherent output, evaluated by the novel panoptic quality (PQ) metric. PQ decomposes into segmentation quality (SQ) and recognition quality (RQ) and uses a simple IoU>0.5 matching to merge stuff and things into one framework. The paper provides groundwork with human consistency studies and machine baselines on Cityscapes, ADE20k, and Mapillary Vistas, demonstrating the practicality and challenges of PS and highlighting a notable gap between human and machine recognition, especially for small objects. This work aims to reinvigorate unified scene understanding and catalyze development of end-to-end PS models that jointly reason about all scene elements without overlaps.

Abstract

We propose and study a task we name panoptic segmentation (PS). Panoptic segmentation unifies the typically distinct tasks of semantic segmentation (assign a class label to each pixel) and instance segmentation (detect and segment each object instance). The proposed task requires generating a coherent scene segmentation that is rich and complete, an important step toward real-world vision systems. While early work in computer vision addressed related image/scene parsing tasks, these are not currently popular, possibly due to lack of appropriate metrics or associated recognition challenges. To address this, we propose a novel panoptic quality (PQ) metric that captures performance for all classes (stuff and things) in an interpretable and unified manner. Using the proposed metric, we perform a rigorous study of both human and machine performance for PS on three existing datasets, revealing interesting insights about the task. The aim of our work is to revive the interest of the community in a more unified view of image segmentation.

Panoptic Segmentation

TL;DR

Panoptic segmentation poses a unified vision task that combines semantic and instance segmentation into a single coherent output, evaluated by the novel panoptic quality (PQ) metric. PQ decomposes into segmentation quality (SQ) and recognition quality (RQ) and uses a simple IoU>0.5 matching to merge stuff and things into one framework. The paper provides groundwork with human consistency studies and machine baselines on Cityscapes, ADE20k, and Mapillary Vistas, demonstrating the practicality and challenges of PS and highlighting a notable gap between human and machine recognition, especially for small objects. This work aims to reinvigorate unified scene understanding and catalyze development of end-to-end PS models that jointly reason about all scene elements without overlaps.

Abstract

We propose and study a task we name panoptic segmentation (PS). Panoptic segmentation unifies the typically distinct tasks of semantic segmentation (assign a class label to each pixel) and instance segmentation (detect and segment each object instance). The proposed task requires generating a coherent scene segmentation that is rich and complete, an important step toward real-world vision systems. While early work in computer vision addressed related image/scene parsing tasks, these are not currently popular, possibly due to lack of appropriate metrics or associated recognition challenges. To address this, we propose a novel panoptic quality (PQ) metric that captures performance for all classes (stuff and things) in an interpretable and unified manner. Using the proposed metric, we perform a rigorous study of both human and machine performance for PS on three existing datasets, revealing interesting insights about the task. The aim of our work is to revive the interest of the community in a more unified view of image segmentation.

Paper Structure

This paper contains 35 sections, 1 theorem, 5 equations, 9 figures, 6 tables.

Key Result

Theorem 1

Given a predicted and ground truth panoptic segmentation of an image, each ground truth segment can have at most one corresponding predicted segment with IoU strictly greater than 0.5 and vice verse.

Figures (9)

  • Figure 1: For a given (\ref{['fig:image']}) image, we show ground truth for: (\ref{['fig:semantic']}) semantic segmentation (per-pixel class labels), (\ref{['fig:instance']}) instance segmentation (per-object mask and class label), and (\ref{['fig:panoptic']}) the proposed panoptic segmentation task (per-pixel class+instance labels). The PS task: (1) encompasses both stuff and thing classes, (2) uses a simple but general format, and (3) introduces a uniform evaluation metric for all classes. Panoptic segmentation generalizes both semantic and instance segmentation and we expect the unified task will present novel challenges and enable innovative new methods.
  • Figure 2: Toy illustration of ground truth and predicted panoptic segmentations of an image. Pairs of segments of the same color have IoU larger than 0.5 and are therefore matched. We show how the segments for the person class are partitioned into true positives $\mathit{TP}$, false negatives $\mathit{FN}$, and false positives $\mathit{FP}$.
  • Figure 3: Segmentation flaws. Images are zoomed and cropped. Top row (Vistas image): both annotators identify the object as a car, however, one splits the car into two cars. Bottom row (Cityscapes image): the segmentation is genuinely ambiguous.
  • Figure 4: Classification flaws. Images are zoomed and cropped. Top row (ADE20k image): simple misclassification. Bottom row (Cityscapes image): the scene is extremely difficult, tram is the correct class for the segment. Many errors are difficult to resolve.
  • Figure 5: Per-Class Human consistency, sorted by PQ. Thing classes are shown in red, stuff classes in orange (for ADE20k every other class is shown, classes without matches in the dual-annotated tests sets are omitted). Things and stuff are distributed fairly evenly, implying PQ balances their performance.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof