Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP
Daniil Pakhomov, Sanchit Hira, Narayani Wagle, Kemar E. Green, Nassir Navab
TL;DR
The paper introduces an unsupervised semantic segmentation framework that leverages StyleGAN2 feature-space clustering to discover consistent semantic regions, augmented by latent-space manipulation and CLIP-based labeling to handle rare classes. A synthetic dataset with generated images and masks is used to train a segmentation model via knowledge distillation, enabling generalization to real images. The approach achieves state-of-the-art results against semi-supervised methods on facial datasets and demonstrates competitive, cross-domain performance on OpenEDS, illustrating practical viability of unsupervised segmentation with language-guided discovery. This work highlights the potential of combining generative modeling and vision-language priors to reduce labeling requirements while maintaining high-quality segmentation.
Abstract
We introduce a method that allows to automatically segment images into semantically meaningful regions without human supervision. Derived regions are consistent across different images and coincide with human-defined semantic classes on some datasets. In cases where semantic regions might be hard for human to define and consistently label, our method is still able to find meaningful and consistent semantic classes. In our work, we use pretrained StyleGAN2 generative model: clustering in the feature space of the generative model allows to discover semantic classes. Once classes are discovered, a synthetic dataset with generated images and corresponding segmentation masks can be created. After that a segmentation model is trained on the synthetic dataset and is able to generalize to real images. Additionally, by using CLIP we are able to use prompts defined in a natural language to discover some desired semantic classes. We test our method on publicly available datasets and show state-of-the-art results.
