Table of Contents
Fetching ...

SOS: Segment Object System for Open-World Instance Segmentation With Object Priors

Christian Wilms, Tim Rolff, Maris Hillemann, Robert Johanson, Simone Frintrop

TL;DR

This work tackles Open-World Instance Segmentation by building SOS, a three-part system that prompts SAM with an object-focused prior to generate high-quality pseudo annotations. A key contribution is identifying DINO self-attention maps as the strongest object prior for focusing SAM on objects, which markedly improves precision when training a standard Mask R-CNN on mixed original and pseudo annotations. Extensive cross-dataset and cross-category evaluations on COCO, LVIS, and ADE20k demonstrate strong generalization to unseen object classes, with precision improvements up to 81.6% over prior methods. Overall, SOS offers a practical, plug-and-play approach to OWIS that leverages foundation models to boost localization quality and segmentation performance without requiring extra supervision.

Abstract

We propose an approach for Open-World Instance Segmentation (OWIS), a task that aims to segment arbitrary unknown objects in images by generalizing from a limited set of annotated object classes during training. Our Segment Object System (SOS) explicitly addresses the generalization ability and the low precision of state-of-the-art systems, which often generate background detections. To this end, we generate high-quality pseudo annotations based on the foundation model SAM. We thoroughly study various object priors to generate prompts for SAM, explicitly focusing the foundation model on objects. The strongest object priors were obtained by self-attention maps from self-supervised Vision Transformers, which we utilize for prompting SAM. Finally, the post-processed segments from SAM are used as pseudo annotations to train a standard instance segmentation system. Our approach shows strong generalization capabilities on COCO, LVIS, and ADE20k datasets and improves on the precision by up to 81.6% compared to the state-of-the-art. Source code is available at: https://github.com/chwilms/SOS

SOS: Segment Object System for Open-World Instance Segmentation With Object Priors

TL;DR

This work tackles Open-World Instance Segmentation by building SOS, a three-part system that prompts SAM with an object-focused prior to generate high-quality pseudo annotations. A key contribution is identifying DINO self-attention maps as the strongest object prior for focusing SAM on objects, which markedly improves precision when training a standard Mask R-CNN on mixed original and pseudo annotations. Extensive cross-dataset and cross-category evaluations on COCO, LVIS, and ADE20k demonstrate strong generalization to unseen object classes, with precision improvements up to 81.6% over prior methods. Overall, SOS offers a practical, plug-and-play approach to OWIS that leverages foundation models to boost localization quality and segmentation performance without requiring extra supervision.

Abstract

We propose an approach for Open-World Instance Segmentation (OWIS), a task that aims to segment arbitrary unknown objects in images by generalizing from a limited set of annotated object classes during training. Our Segment Object System (SOS) explicitly addresses the generalization ability and the low precision of state-of-the-art systems, which often generate background detections. To this end, we generate high-quality pseudo annotations based on the foundation model SAM. We thoroughly study various object priors to generate prompts for SAM, explicitly focusing the foundation model on objects. The strongest object priors were obtained by self-attention maps from self-supervised Vision Transformers, which we utilize for prompting SAM. Finally, the post-processed segments from SAM are used as pseudo annotations to train a standard instance segmentation system. Our approach shows strong generalization capabilities on COCO, LVIS, and ADE20k datasets and improves on the precision by up to 81.6% compared to the state-of-the-art. Source code is available at: https://github.com/chwilms/SOS
Paper Structure (33 sections, 5 figures, 5 tables)

This paper contains 33 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of (pseudo) annotations (left) used by Mask R-CNN he2017mask, GGN wang2022open, and our SOS, when only original annotations of VOC object classes are given. While Mask R-CNN only uses original annotations without object classes such as tennis racket or surfboard (red arrows), SOS generates pseudo annotations covering those classes (green arrows). GGN generates noisy pseudo annotations including background areas. As a result, only SOS constantly detects these objects not annotated in training (green vs. red arrows on the right). Filled masks denote annotations (left) or detected objects (right), while red frames indicate missed objects.
  • Figure 2: Overview of our Segment Object System (SOS) for OWIS consisting of three blocks. First, the input image is processed in our Object Localization Module (OLM, yellow area) to create object-focused point prompts roughly localizing objects. Second, our Pseudo Annotations Creator (PAC, green area) generates segments based on the previously generated prompts using SAM kirillov2023segment, and further processes them, leading to a final set of merged original and pseudo annotations. Finally, the merged annotations are used to train an instance segmentation system (blue area).
  • Figure 3: Object priors VOCUS2, U-Net, and DINO with resulting pseudo annotations.
  • Figure 4: Qualitative results of OWIS methods and baseline Mask R-CNN in the cross-category COCO $\text{(VOC)}\rightarrow \text{COCO}$ (non-VOC) setting. Filled masks denote detected objects, while red frames indicate missed objects.
  • Figure 5: Pseudo annotations generated by GGN and our SOS.