Table of Contents
Fetching ...

SOHES: Self-supervised Open-world Hierarchical Entity Segmentation

Shengcao Cao, Jiuxiang Gu, Jason Kuen, Hao Tan, Ruiyi Zhang, Handong Zhao, Ani Nenkova, Liang-Yan Gui, Tong Sun, Yu-Xiong Wang

TL;DR

SOHES addresses open-world entity segmentation without human annotations by a three-phase self-supervised pipeline: self-exploration to generate high-quality pseudo-labels from self-supervised features, self-instruction to train a hierarchical segmentation model, and self-correction via teacher–student mutual learning to reduce noise. It additionally learns hierarchical relations among masks to represent entities and their constituent parts, producing multi-level forest structures. Built atop a DINO-based representation and Mask2Former, with an ancestor-prediction head, SOHES achieves state-of-the-art performance among self-supervised approaches and significantly narrows the gap to supervised SAM using only 2% of unlabeled SA-1B data. The approach demonstrates strong zero-shot generalization across COCO, LVIS, ADE20K, EntitySeg, and SA-1B, and enhances downstream backbone features for dense-prediction tasks, highlighting practical impact for open-world vision applications.

Abstract

Open-world entity segmentation, as an emerging computer vision task, aims at segmenting entities in images without being restricted by pre-defined classes, offering impressive generalization capabilities on unseen images and concepts. Despite its promise, existing entity segmentation methods like Segment Anything Model (SAM) rely heavily on costly expert annotators. This work presents Self-supervised Open-world Hierarchical Entity Segmentation (SOHES), a novel approach that eliminates the need for human annotations. SOHES operates in three phases: self-exploration, self-instruction, and self-correction. Given a pre-trained self-supervised representation, we produce abundant high-quality pseudo-labels through visual feature clustering. Then, we train a segmentation model on the pseudo-labels, and rectify the noises in pseudo-labels via a teacher-student mutual-learning procedure. Beyond segmenting entities, SOHES also captures their constituent parts, providing a hierarchical understanding of visual entities. Using raw images as the sole training data, our method achieves unprecedented performance in self-supervised open-world segmentation, marking a significant milestone towards high-quality open-world entity segmentation in the absence of human-annotated masks. Project page: https://SOHES-ICLR.github.io.

SOHES: Self-supervised Open-world Hierarchical Entity Segmentation

TL;DR

SOHES addresses open-world entity segmentation without human annotations by a three-phase self-supervised pipeline: self-exploration to generate high-quality pseudo-labels from self-supervised features, self-instruction to train a hierarchical segmentation model, and self-correction via teacher–student mutual learning to reduce noise. It additionally learns hierarchical relations among masks to represent entities and their constituent parts, producing multi-level forest structures. Built atop a DINO-based representation and Mask2Former, with an ancestor-prediction head, SOHES achieves state-of-the-art performance among self-supervised approaches and significantly narrows the gap to supervised SAM using only 2% of unlabeled SA-1B data. The approach demonstrates strong zero-shot generalization across COCO, LVIS, ADE20K, EntitySeg, and SA-1B, and enhances downstream backbone features for dense-prediction tasks, highlighting practical impact for open-world vision applications.

Abstract

Open-world entity segmentation, as an emerging computer vision task, aims at segmenting entities in images without being restricted by pre-defined classes, offering impressive generalization capabilities on unseen images and concepts. Despite its promise, existing entity segmentation methods like Segment Anything Model (SAM) rely heavily on costly expert annotators. This work presents Self-supervised Open-world Hierarchical Entity Segmentation (SOHES), a novel approach that eliminates the need for human annotations. SOHES operates in three phases: self-exploration, self-instruction, and self-correction. Given a pre-trained self-supervised representation, we produce abundant high-quality pseudo-labels through visual feature clustering. Then, we train a segmentation model on the pseudo-labels, and rectify the noises in pseudo-labels via a teacher-student mutual-learning procedure. Beyond segmenting entities, SOHES also captures their constituent parts, providing a hierarchical understanding of visual entities. Using raw images as the sole training data, our method achieves unprecedented performance in self-supervised open-world segmentation, marking a significant milestone towards high-quality open-world entity segmentation in the absence of human-annotated masks. Project page: https://SOHES-ICLR.github.io.
Paper Structure (19 sections, 3 equations, 14 figures, 8 tables)

This paper contains 19 sections, 3 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: SOHES boosts open-world entity segmentation with self-supervision on various image datasets. Compared to prior state of the art, SOHES significantly reduces the gap between self-supervised methods and the supervised Segment Anything Model (SAM) kirillov2023segment, yet using only 2% unlabeled image data as SAM.
  • Figure 2: Three phases of SOHES. In the first self-exploration phase, we cluster visual features from pre-trained DINO to generate initial pseudo-labels on unlabeled images. Then in the self-instruction phase, a segmentation model learns from the initial pseudo-labels. Finally, in the self-correction phase, we adopt a teacher-student framework to further refine the segmentation model.
  • Figure 3: Self-exploration phase for generating initial pseudo-labels. This phase consists of four steps. We first merge image patches into regions with high visual feature similarities, then zoom in on the small candidate regions and re-cluster the local images to better discover small entities. After that, we refine the mask details and identify the hierarchical structure among the masks.
  • Figure 4: Ancestor relation prediction in the self-instruction phase. The prediction target, a binary matrix of ancestor relations, is constructed from the hierarchical structure identified in the self-exploration phase. The ancestor prediction head uses two linear mappings $W_1,W_2$ to transform the query features $Q$ and learns to predict the target ancestors.
  • Figure 5: Teacher-student mutual-learning in the self-correction phase. We initialize both the teacher and student with the segmentation model learned in the self-instruction phase, which produces better segmentation predictions than the initial pseudo-labels. The student receives supervision from the teacher's pseudo-labels and the initial pseudo-labels. The teacher is updated as the exponential moving average (EMA) of the student.
  • ...and 9 more figures