Table of Contents
Fetching ...

Self-Supervised Vision Transformers Are Efficient Segmentation Learners for Imperfect Labels

Seungho Lee, Seoungyoon Kang, Hyunjung Shim

TL;DR

The paper addresses the high annotation cost of semantic segmentation by leveraging a frozen self-supervised vision transformer backbone and training a lightweight segmentation head, thereby exploiting the backbone's shape priors to handle imperfect labels. The approach uses a simple linear head with a pixel-wise cross-entropy loss conditioned on imperfect masks, preserving the backbone features while enabling class predictions. Empirical results on PASCAL VOC 2012 show consistent improvements over state-of-the-art weakly supervised methods across scribble, point, image-level, and zero-shot vision-language labels, with notable gains such as 11.5 percentage points in zero-shot settings. The key contributions include a cost-effective strategy for weakly supervised segmentation, demonstrated robustness to label quality, and evidence that self-supervised vision transformers provide a strong backbone for this task, reducing training cost while maintaining performance.

Abstract

This study demonstrates a cost-effective approach to semantic segmentation using self-supervised vision transformers (SSVT). By freezing the SSVT backbone and training a lightweight segmentation head, our approach effectively utilizes imperfect labels, thereby improving robustness to label imperfections. Empirical experiments show significant performance improvements over existing methods for various annotation types, including scribble, point-level, and image-level labels. The research highlights the effectiveness of self-supervised vision transformers in dealing with imperfect labels, providing a practical and efficient solution for semantic segmentation while reducing annotation costs. Through extensive experiments, we confirm that our method outperforms baseline models for all types of imperfect labels. Especially under the zero-shot vision-language-model-based label, our model exhibits 11.5\%p performance gain compared to the baseline.

Self-Supervised Vision Transformers Are Efficient Segmentation Learners for Imperfect Labels

TL;DR

The paper addresses the high annotation cost of semantic segmentation by leveraging a frozen self-supervised vision transformer backbone and training a lightweight segmentation head, thereby exploiting the backbone's shape priors to handle imperfect labels. The approach uses a simple linear head with a pixel-wise cross-entropy loss conditioned on imperfect masks, preserving the backbone features while enabling class predictions. Empirical results on PASCAL VOC 2012 show consistent improvements over state-of-the-art weakly supervised methods across scribble, point, image-level, and zero-shot vision-language labels, with notable gains such as 11.5 percentage points in zero-shot settings. The key contributions include a cost-effective strategy for weakly supervised segmentation, demonstrated robustness to label quality, and evidence that self-supervised vision transformers provide a strong backbone for this task, reducing training cost while maintaining performance.

Abstract

This study demonstrates a cost-effective approach to semantic segmentation using self-supervised vision transformers (SSVT). By freezing the SSVT backbone and training a lightweight segmentation head, our approach effectively utilizes imperfect labels, thereby improving robustness to label imperfections. Empirical experiments show significant performance improvements over existing methods for various annotation types, including scribble, point-level, and image-level labels. The research highlights the effectiveness of self-supervised vision transformers in dealing with imperfect labels, providing a practical and efficient solution for semantic segmentation while reducing annotation costs. Through extensive experiments, we confirm that our method outperforms baseline models for all types of imperfect labels. Especially under the zero-shot vision-language-model-based label, our model exhibits 11.5\%p performance gain compared to the baseline.
Paper Structure (12 sections, 1 equation, 3 figures, 4 tables)

This paper contains 12 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of our method. $\mathcal{L}$ represents the matching loss for each imperfect mask type (Equation \ref{['eqn:ssce']}). For image-level label (class), $\mathcal{L}$ is pixel-wise cross-entropy. For others, $\mathcal{L}$ is masked pixel-wise cross-entropy. The backbone of the self-supervised vision transformer model is fixed during semantic segmentation training. Only the segmentation head is trained on imperfect masks and their corresponding images.
  • Figure 2: DINOv2 feature analysis. For each image pair, the right image is the result of applying K-means clustering to each token from DINOv2 using the left image. Without any supervision, DINOv2 exhibits a strong shape prior, indicating that the objects are identifiable only with the K-means clustering.
  • Figure 3: Qualitative evaluation on image-level labels.