Table of Contents
Fetching ...

Segment Anything without Supervision

XuDong Wang, Jingfeng Yang, Trevor Darrell

TL;DR

UnSAM advances segmentation by removing the dependence on large-scale human annotations. It employs a divide-and-conquer strategy to generate rich, hierarchical pseudo-masks from unlabeled images, enabling both automatic whole-image and promptable segmentation without supervision. Through self-training and strategic fusion with SA-1B ground-truth, UnSAM achieves competitive zeros-shot performance with 1% of SA-1B data and even surpasses fully supervised SAM in several settings (notably via UnSAM+). The results demonstrate the practical impact of unsupervised, multi-granular segmentation for open-world tasks and offer a scalable path toward less biased, more detailed scene understanding. The approach also provides a transferable framework for integrating unsupervised masks with supervised data to boost performance across diverse datasets.

Abstract

The Segmentation Anything Model (SAM) requires labor-intensive data labeling. We present Unsupervised SAM (UnSAM) for promptable and automatic whole-image segmentation that does not require human annotations. UnSAM utilizes a divide-and-conquer strategy to "discover" the hierarchical structure of visual scenes. We first leverage top-down clustering methods to partition an unlabeled image into instance/semantic level segments. For all pixels within a segment, a bottom-up clustering method is employed to iteratively merge them into larger groups, thereby forming a hierarchical structure. These unsupervised multi-granular masks are then utilized to supervise model training. Evaluated across seven popular datasets, UnSAM achieves competitive results with the supervised counterpart SAM, and surpasses the previous state-of-the-art in unsupervised segmentation by 11% in terms of AR. Moreover, we show that supervised SAM can also benefit from our self-supervised labels. By integrating our unsupervised pseudo masks into SA-1B's ground-truth masks and training UnSAM with only 1% of SA-1B, a lightly semi-supervised UnSAM can often segment entities overlooked by supervised SAM, exceeding SAM's AR by over 6.7% and AP by 3.9% on SA-1B.

Segment Anything without Supervision

TL;DR

UnSAM advances segmentation by removing the dependence on large-scale human annotations. It employs a divide-and-conquer strategy to generate rich, hierarchical pseudo-masks from unlabeled images, enabling both automatic whole-image and promptable segmentation without supervision. Through self-training and strategic fusion with SA-1B ground-truth, UnSAM achieves competitive zeros-shot performance with 1% of SA-1B data and even surpasses fully supervised SAM in several settings (notably via UnSAM+). The results demonstrate the practical impact of unsupervised, multi-granular segmentation for open-world tasks and offer a scalable path toward less biased, more detailed scene understanding. The approach also provides a transferable framework for integrating unsupervised masks with supervised data to boost performance across diverse datasets.

Abstract

The Segmentation Anything Model (SAM) requires labor-intensive data labeling. We present Unsupervised SAM (UnSAM) for promptable and automatic whole-image segmentation that does not require human annotations. UnSAM utilizes a divide-and-conquer strategy to "discover" the hierarchical structure of visual scenes. We first leverage top-down clustering methods to partition an unlabeled image into instance/semantic level segments. For all pixels within a segment, a bottom-up clustering method is employed to iteratively merge them into larger groups, thereby forming a hierarchical structure. These unsupervised multi-granular masks are then utilized to supervise model training. Evaluated across seven popular datasets, UnSAM achieves competitive results with the supervised counterpart SAM, and surpasses the previous state-of-the-art in unsupervised segmentation by 11% in terms of AR. Moreover, we show that supervised SAM can also benefit from our self-supervised labels. By integrating our unsupervised pseudo masks into SA-1B's ground-truth masks and training UnSAM with only 1% of SA-1B, a lightly semi-supervised UnSAM can often segment entities overlooked by supervised SAM, exceeding SAM's AR by over 6.7% and AP by 3.9% on SA-1B.
Paper Structure (24 sections, 3 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 24 sections, 3 equations, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: UnSAM significantly surpasses the performance of the previous SOTA methods in unsupervised segmentation, and delivers impressive whole image and promptable segmentation results, rivaling the performance of the supervised SAM kirillov2023segany. This comparative analysis features our unsupervised UnSAM, the supervised SAM, and an enhanced version, UnSAM+, across a variety of datasets. The top section displays raw images (row 1) alongside whole image segmentation outputs from UnSAM (row 2), and SAM (row 3). The bottom section highlights our promptable segmentation results using a point prompt (i.e., the star mark). The right panel quantitatively compares the performance across models, including metrics like Mask AR (%) and Point IoU.
  • Figure 2: Our divide-and-conquer pipeline for generating the "ground-truth" pseudo masks used for training UnSAM without human supervision begins with a top-down clustering approach (i.e., the divide stage), to extract initial semantic/instance-level masks using a Normalized Cuts shi2000normalized-based CutLER wang2023cut. Subsequently, we refine these masks using a bottom-up clustering method (i.e., the conquer stage): within each mask, we iteratively merge semantically similar pixels into larger segments using various similarity thresholds. The resulting masks at different thresholds create a hierarchy. We zoom-in selected regions to visualize details.
  • Figure 3: Unsupervised pseudo-masks generated by our divide-and-conquer pipeline not only contain precise masks for coarse-grained instances (column 5), e.g., cameras and persons, but also capture fine-grained parts (column 3), e.g., digits and icons on a tiny camera monitor that are missed by SA-1B's kirillov2023segany ground-truth labels.
  • Figure 4: UnSAM has competitive dense object segmentation results compared to the supervised SAM kirillov2023segany.
  • Figure 5: UnSAM not only discovers more fine-grained masks than the previous state-of-the-art unsupervised segmentation method cao2024sohes, but also provides segmentation masks with a wide range of granularity. We show qualitative comparisons between UnSAM (with 3 levels of granularity) and baseline models on SA-1B kirillov2023segany.
  • ...and 5 more figures