Self-Supervised Vision Transformers Are Efficient Segmentation Learners for Imperfect Labels
Seungho Lee, Seoungyoon Kang, Hyunjung Shim
TL;DR
The paper addresses the high annotation cost of semantic segmentation by leveraging a frozen self-supervised vision transformer backbone and training a lightweight segmentation head, thereby exploiting the backbone's shape priors to handle imperfect labels. The approach uses a simple linear head with a pixel-wise cross-entropy loss conditioned on imperfect masks, preserving the backbone features while enabling class predictions. Empirical results on PASCAL VOC 2012 show consistent improvements over state-of-the-art weakly supervised methods across scribble, point, image-level, and zero-shot vision-language labels, with notable gains such as 11.5 percentage points in zero-shot settings. The key contributions include a cost-effective strategy for weakly supervised segmentation, demonstrated robustness to label quality, and evidence that self-supervised vision transformers provide a strong backbone for this task, reducing training cost while maintaining performance.
Abstract
This study demonstrates a cost-effective approach to semantic segmentation using self-supervised vision transformers (SSVT). By freezing the SSVT backbone and training a lightweight segmentation head, our approach effectively utilizes imperfect labels, thereby improving robustness to label imperfections. Empirical experiments show significant performance improvements over existing methods for various annotation types, including scribble, point-level, and image-level labels. The research highlights the effectiveness of self-supervised vision transformers in dealing with imperfect labels, providing a practical and efficient solution for semantic segmentation while reducing annotation costs. Through extensive experiments, we confirm that our method outperforms baseline models for all types of imperfect labels. Especially under the zero-shot vision-language-model-based label, our model exhibits 11.5\%p performance gain compared to the baseline.
