Table of Contents
Fetching ...

Learning Accurate Segmentation Purely from Self-Supervision

Zuyao You, Zuxuan Wu, Yu-Gang Jiang

TL;DR

This work introduces Selfment, a fully self-supervised framework that segments foreground objects directly from raw images without human labels, pretrained segmentation models, or any post-processing, and introduces Iterative Patch Optimization (IPO), a feature-space refinement procedure that progressively enforces spatial coherence and semantic consistency through iterative patch clustering.

Abstract

Accurately segmenting objects without any manual annotations remains one of the core challenges in computer vision. In this work, we introduce Selfment, a fully self-supervised framework that segments foreground objects directly from raw images without human labels, pretrained segmentation models, or any post-processing. Selfment first constructs patch-level affinity graphs from self-supervised features and applies NCut to obtain an initial coarse foreground--background separation. We then introduce Iterative Patch Optimization (IPO), a feature-space refinement procedure that progressively enforces spatial coherence and semantic consistency through iterative patch clustering. The refined masks are subsequently used as supervisory signals to train a lightweight segmentation head with contrastive and region-consistency objectives, allowing the model to learn stable and transferable object representations. Despite its simplicity and complete absence of manual supervision, Selfment sets new state-of-the-art (SoTA) results across multiple benchmarks. It achieves substantial improvements on $F_{\max}$ over previous unsupervised saliency detection methods on ECSSD ($+4.0\%$), HKUIS ($+4.6\%$), and PASCAL-S ($+5.7\%$). Moreover, without any additional fine-tuning, Selfment demonstrates remarkable zero-shot generalization to camouflaged object detection tasks (e.g., $0.910$ $S_m$ on CHAMELEON and $0.792$ $F_β^ω$ on CAMO), outperforming all existing unsupervised approaches and even rivaling the SoTA fully supervised methods.

Learning Accurate Segmentation Purely from Self-Supervision

TL;DR

This work introduces Selfment, a fully self-supervised framework that segments foreground objects directly from raw images without human labels, pretrained segmentation models, or any post-processing, and introduces Iterative Patch Optimization (IPO), a feature-space refinement procedure that progressively enforces spatial coherence and semantic consistency through iterative patch clustering.

Abstract

Accurately segmenting objects without any manual annotations remains one of the core challenges in computer vision. In this work, we introduce Selfment, a fully self-supervised framework that segments foreground objects directly from raw images without human labels, pretrained segmentation models, or any post-processing. Selfment first constructs patch-level affinity graphs from self-supervised features and applies NCut to obtain an initial coarse foreground--background separation. We then introduce Iterative Patch Optimization (IPO), a feature-space refinement procedure that progressively enforces spatial coherence and semantic consistency through iterative patch clustering. The refined masks are subsequently used as supervisory signals to train a lightweight segmentation head with contrastive and region-consistency objectives, allowing the model to learn stable and transferable object representations. Despite its simplicity and complete absence of manual supervision, Selfment sets new state-of-the-art (SoTA) results across multiple benchmarks. It achieves substantial improvements on over previous unsupervised saliency detection methods on ECSSD (), HKUIS (), and PASCAL-S (). Moreover, without any additional fine-tuning, Selfment demonstrates remarkable zero-shot generalization to camouflaged object detection tasks (e.g., on CHAMELEON and on CAMO), outperforming all existing unsupervised approaches and even rivaling the SoTA fully supervised methods.
Paper Structure (25 sections, 13 equations, 9 figures, 4 tables)

This paper contains 25 sections, 13 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: We propose Selfment, a fully self-supervised framework for foreground segmentation that generates highly detailed and accurate saliency maps without any human-annotated labels or post-processing.
  • Figure 2: An overview of Selfment. The input image is first encoded by a self-supervised backbone to produce dense patch features. These features define a patch-level affinity graph, from which we derive an initial foreground-background split using the second-smallest eigenvector of the NCut. We then apply Iterative Patch Optimization to improve spatial coherence and semantic consistency. The refined masks then serve as supervisory signals for training a lightweight segmentation head.
  • Figure 3: Comparison with previous state-of-the-art methods on the unsupervised saliency detection task. All methods are evaluated without any post-processing at an inference resolution of $1280 \times 1280$.
  • Figure 4: Comparison with previous state-of-the-art on the camouflaged object detection tasks.
  • Figure 5: Comparison of segmentation performance among TokenCut wang2022self, SelfMask shin2022selfmask, FOUND simeoni2023found, and Selfment using DINO-Base, DINOv3-Huge+, and DINOv3-7B as backbones. Metrics are reported on the ECSSD shi2015hierarchical dataset.
  • ...and 4 more figures