Table of Contents
Fetching ...

Extreme Point Supervised Instance Segmentation

Hyeonjun Lee, Sehyun Hwang, Suha Kwak

TL;DR

EXITS addresses the high cost of pixel-level masks in instance segmentation by exploiting extreme points that accompany bounding box annotations. It introduces a two-stage approach: a ViT-based pseudo label generator learns to produce high-quality object masks from single-object crops by propagating seeds over a fully connected point graph using a transition matrix $\mathbf{T}$ derived from a similarity matrix $\mathbf{S}$ obtained from a pretrained similarity extractor and refined with Sinkhorn normalization. In the second stage, these pseudo masks train a fully supervised instance segmentation model, enabling strong performance with box- or extreme-point supervision while narrowing the gap to fully supervised methods. The method shows state-of-the-art results on COCO, Pascal VOC, and LVIS, with particular strength on separated/occluded objects, and analyses reveal the importance of pseudo-label quality, propagation depth $\alpha$, and similarity extractor warm-up.

Abstract

This paper introduces a novel approach to learning instance segmentation using extreme points, i.e., the topmost, leftmost, bottommost, and rightmost points, of each object. These points are readily available in the modern bounding box annotation process while offering strong clues for precise segmentation, and thus allows to improve performance at the same annotation cost with box-supervised methods. Our work considers extreme points as a part of the true instance mask and propagates them to identify potential foreground and background points, which are all together used for training a pseudo label generator. Then pseudo labels given by the generator are in turn used for supervised learning of our final model. On three public benchmarks, our method significantly outperforms existing box-supervised methods, further narrowing the gap with its fully supervised counterpart. In particular, our model generates high-quality masks when a target object is separated into multiple parts, where previous box-supervised methods often fail.

Extreme Point Supervised Instance Segmentation

TL;DR

EXITS addresses the high cost of pixel-level masks in instance segmentation by exploiting extreme points that accompany bounding box annotations. It introduces a two-stage approach: a ViT-based pseudo label generator learns to produce high-quality object masks from single-object crops by propagating seeds over a fully connected point graph using a transition matrix derived from a similarity matrix obtained from a pretrained similarity extractor and refined with Sinkhorn normalization. In the second stage, these pseudo masks train a fully supervised instance segmentation model, enabling strong performance with box- or extreme-point supervision while narrowing the gap to fully supervised methods. The method shows state-of-the-art results on COCO, Pascal VOC, and LVIS, with particular strength on separated/occluded objects, and analyses reveal the importance of pseudo-label quality, propagation depth , and similarity extractor warm-up.

Abstract

This paper introduces a novel approach to learning instance segmentation using extreme points, i.e., the topmost, leftmost, bottommost, and rightmost points, of each object. These points are readily available in the modern bounding box annotation process while offering strong clues for precise segmentation, and thus allows to improve performance at the same annotation cost with box-supervised methods. Our work considers extreme points as a part of the true instance mask and propagates them to identify potential foreground and background points, which are all together used for training a pseudo label generator. Then pseudo labels given by the generator are in turn used for supervised learning of our final model. On three public benchmarks, our method significantly outperforms existing box-supervised methods, further narrowing the gap with its fully supervised counterpart. In particular, our model generates high-quality masks when a target object is separated into multiple parts, where previous box-supervised methods often fail.
Paper Structure (23 sections, 12 equations, 9 figures, 11 tables)

This paper contains 23 sections, 12 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Types of weak supervision and how to utilize it for instance segmentation. Top: Box-supervised method relies on bounding box tightness prior, which is often violated by occlusion (foreground bag contains tree trunk). As a result, the prediction of the method shows an error in the occluded region. Bottom: Extreme point supervised method (Ours) utilizes extreme points as the initial set of foreground points and propagate label through semantic similarity between points. The prediction result demonstrates that our method can predict object mask even in severe occlusion. Best viewed in color.
  • Figure 2: Overview of entire stages of EXITS. In the first stage, an image cropped around each object is used as an input to train the pseudo label generator using point-wise supervision, so that the generator learns to predict a binary mask of the object within the cropped image. In the second stage, the instance segmentation model learns to detect and segment multiple objects, using the generated pseudo mask labels from the first stage.
  • Figure 3: Overview of the first stage of EXITS framework. The pseudo label generator is trained on images cropped around each object using the extreme points, aiming to predict binary masks. Training leverages two loss functions: $\mathcal{L}_\text{crf}$ aligns images before and after CRF crf processing, and $\mathcal{L}_\text{point}$ uses extreme points-derived pseudo point labels for precise pixel-wise supervision. To generate these pseudo point labels, EXITS obtains initial foreground and background points from extreme points, then employs the similarity matrix from warm-up trained similarity extractor for label propagation. After propagation, pseudo point labels are produced based on the difference of propagation score from the inital foreground and background points. Point dropout is applied as an augmentation generating the final pseudo point labels.
  • Figure 4: Qualitative comparison of pseudo mask labels on the Separated COCO dataset. (a) Ours, (b) MAL Lan_2023_CVPR, (c) Ground Truth.
  • Figure 5: Qualitative results of the final prediction of EXIST on COCO test-dev, using Mask2Former with Swin-Small backbone. Our generated pseudo mask labels, EXITS produces high-quality segmentation results, even in separated objects or complex scenes.
  • ...and 4 more figures