Table of Contents
Fetching ...

The devil is in the object boundary: towards annotation-free instance segmentation using Foundation Models

Cheng Shi, Sibei Yang

TL;DR

The paper addresses the annotation bottleneck in instance segmentation by identifying that existing foundation-models like DINO and SAM fail to separate closely packed instances due to boundary ambiguity. It proposes Zip, a classification-first-then-discovery pipeline that leverages CLIP-derived dense semantic clues and a boundary-prior from a specific CLIP middle layer to guide clustering, fragment selection, and SAM-based mask refinement for annotation-free, open-vocabulary detection and segmentation. The approach yields substantial zero-shot gains on COCO, enables open-vocabulary detection competitive with supervised baselines after self-training, and demonstrates data-efficient tuning with limited labels. Overall, Zip provides a practical, scalable path to annotate-free instance segmentation by effectively coupling CLIP boundary discovery with SAM segmentation in a multi-stage framework.

Abstract

Foundation models, pre-trained on a large amount of data have demonstrated impressive zero-shot capabilities in various downstream tasks. However, in object detection and instance segmentation, two fundamental computer vision tasks heavily reliant on extensive human annotations, foundation models such as SAM and DINO struggle to achieve satisfactory performance. In this study, we reveal that the devil is in the object boundary, \textit{i.e.}, these foundation models fail to discern boundaries between individual objects. For the first time, we probe that CLIP, which has never accessed any instance-level annotations, can provide a highly beneficial and strong instance-level boundary prior in the clustering results of its particular intermediate layer. Following this surprising observation, we propose $\textbf{Zip}$ which $\textbf{Z}$ips up CL$\textbf{ip}$ and SAM in a novel classification-first-then-discovery pipeline, enabling annotation-free, complex-scene-capable, open-vocabulary object detection and instance segmentation. Our Zip significantly boosts SAM's mask AP on COCO dataset by 12.5% and establishes state-of-the-art performance in various settings, including training-free, self-training, and label-efficient finetuning. Furthermore, annotation-free Zip even achieves comparable performance to the best-performing open-vocabulary object detecters using base annotations. Code is released at https://github.com/ChengShiest/Zip-Your-CLIP

The devil is in the object boundary: towards annotation-free instance segmentation using Foundation Models

TL;DR

The paper addresses the annotation bottleneck in instance segmentation by identifying that existing foundation-models like DINO and SAM fail to separate closely packed instances due to boundary ambiguity. It proposes Zip, a classification-first-then-discovery pipeline that leverages CLIP-derived dense semantic clues and a boundary-prior from a specific CLIP middle layer to guide clustering, fragment selection, and SAM-based mask refinement for annotation-free, open-vocabulary detection and segmentation. The approach yields substantial zero-shot gains on COCO, enables open-vocabulary detection competitive with supervised baselines after self-training, and demonstrates data-efficient tuning with limited labels. Overall, Zip provides a practical, scalable path to annotate-free instance segmentation by effectively coupling CLIP boundary discovery with SAM segmentation in a multi-stage framework.

Abstract

Foundation models, pre-trained on a large amount of data have demonstrated impressive zero-shot capabilities in various downstream tasks. However, in object detection and instance segmentation, two fundamental computer vision tasks heavily reliant on extensive human annotations, foundation models such as SAM and DINO struggle to achieve satisfactory performance. In this study, we reveal that the devil is in the object boundary, \textit{i.e.}, these foundation models fail to discern boundaries between individual objects. For the first time, we probe that CLIP, which has never accessed any instance-level annotations, can provide a highly beneficial and strong instance-level boundary prior in the clustering results of its particular intermediate layer. Following this surprising observation, we propose which ips up CL and SAM in a novel classification-first-then-discovery pipeline, enabling annotation-free, complex-scene-capable, open-vocabulary object detection and instance segmentation. Our Zip significantly boosts SAM's mask AP on COCO dataset by 12.5% and establishes state-of-the-art performance in various settings, including training-free, self-training, and label-efficient finetuning. Furthermore, annotation-free Zip even achieves comparable performance to the best-performing open-vocabulary object detecters using base annotations. Code is released at https://github.com/ChengShiest/Zip-Your-CLIP
Paper Structure (36 sections, 7 equations, 14 figures, 10 tables)

This paper contains 36 sections, 7 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Comparison of different annotation-free object discovery methods and zero-shot SAM-H. Previous state-of-the-art methods rely on DINO, struggle to discern between instance-level objects, and often miss potential objects within the background. On the other hand, SAM generates valid segmentation masks but struggles to determine the confidence of a particular mask representing an object. In contrast, our approach, employing clustering outcomes of CLIP's intermediate features, can effectively outline boundaries between objects to differentiate them.
  • Figure 2: Clustering result and boundary comparison between different methods. In our clustering results from A to D, the boundaries between objects are successfully clustered into particular clusters marked in orange or gray, serving as strong priors for segmenting objects of the same category. In the middle, we showcase the edges extracted by various methods. DINO provides only prominent outer edges, while SAM segments all edges, but such edges cannot distinguish which ones are object boundaries of interest. The devil is in the boundary.
  • Figure 3: Zip: Multi-object Discovery with No Supervision. Zip follows a classification-first-then-discovery approach, consisting of four steps: 1) Classification first to obtain semantic clues provided by CLIP, where the semantic clues indicate the approximate activation regions of potential objects. 2) Clustering on CLIP's features at a specific intermediate layer to discover object boundaries with the aid of our semantic-aware initialization. The semantic-aware initialization leverages semantic activation to automatically initialize clustering centers and determine the number of clusters. 3) Localization of individual objects by regrouping dispersed clustered fragments that have the same semantics, all while adhering to the detected boundaries. 4) Prompting SAM for precise masks for each individual object.
  • Figure 4: Label-efficient Tuning on COCO. We initialize the weights derived from various self-supervised methods and then fine-tune them on varying proportions of labeled data.
  • Figure 5: The influence of different layers of CLIP and initialization methods for clustering results.
  • ...and 9 more figures