Table of Contents
Fetching ...

Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP

Daniil Pakhomov, Sanchit Hira, Narayani Wagle, Kemar E. Green, Nassir Navab

TL;DR

The paper introduces an unsupervised semantic segmentation framework that leverages StyleGAN2 feature-space clustering to discover consistent semantic regions, augmented by latent-space manipulation and CLIP-based labeling to handle rare classes. A synthetic dataset with generated images and masks is used to train a segmentation model via knowledge distillation, enabling generalization to real images. The approach achieves state-of-the-art results against semi-supervised methods on facial datasets and demonstrates competitive, cross-domain performance on OpenEDS, illustrating practical viability of unsupervised segmentation with language-guided discovery. This work highlights the potential of combining generative modeling and vision-language priors to reduce labeling requirements while maintaining high-quality segmentation.

Abstract

We introduce a method that allows to automatically segment images into semantically meaningful regions without human supervision. Derived regions are consistent across different images and coincide with human-defined semantic classes on some datasets. In cases where semantic regions might be hard for human to define and consistently label, our method is still able to find meaningful and consistent semantic classes. In our work, we use pretrained StyleGAN2 generative model: clustering in the feature space of the generative model allows to discover semantic classes. Once classes are discovered, a synthetic dataset with generated images and corresponding segmentation masks can be created. After that a segmentation model is trained on the synthetic dataset and is able to generalize to real images. Additionally, by using CLIP we are able to use prompts defined in a natural language to discover some desired semantic classes. We test our method on publicly available datasets and show state-of-the-art results.

Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP

TL;DR

The paper introduces an unsupervised semantic segmentation framework that leverages StyleGAN2 feature-space clustering to discover consistent semantic regions, augmented by latent-space manipulation and CLIP-based labeling to handle rare classes. A synthetic dataset with generated images and masks is used to train a segmentation model via knowledge distillation, enabling generalization to real images. The approach achieves state-of-the-art results against semi-supervised methods on facial datasets and demonstrates competitive, cross-domain performance on OpenEDS, illustrating practical viability of unsupervised segmentation with language-guided discovery. This work highlights the potential of combining generative modeling and vision-language priors to reduce labeling requirements while maintaining high-quality segmentation.

Abstract

We introduce a method that allows to automatically segment images into semantically meaningful regions without human supervision. Derived regions are consistent across different images and coincide with human-defined semantic classes on some datasets. In cases where semantic regions might be hard for human to define and consistently label, our method is still able to find meaningful and consistent semantic classes. In our work, we use pretrained StyleGAN2 generative model: clustering in the feature space of the generative model allows to discover semantic classes. Once classes are discovered, a synthetic dataset with generated images and corresponding segmentation masks can be created. After that a segmentation model is trained on the synthetic dataset and is able to generalize to real images. Additionally, by using CLIP we are able to use prompts defined in a natural language to discover some desired semantic classes. We test our method on publicly available datasets and show state-of-the-art results.

Paper Structure

This paper contains 9 sections, 1 equation, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Example of synthetic image and annotation pairs created with our method for hair segmentation (first and second row) and background segmentation (third row).
  • Figure 2: Figure demonstrates synthetic dataset examples generated using our method with a stylegan model pretrained on Flickr-Faces-HQ (FFHQ) karras2018style dataset (first row), Animal faces (AFHQ) choi2020stargan dataset (second and third rows for cats and dogs) and a cartoon dataset pony. Semantic regions proposed by the network are consistent across samples even though there is no clear visual border between most semantic classes: it is usually hard for human annotators to consistently label examples like this while our method works well.
  • Figure 3: Figure demonstrates a process of creating of a synthetic dataset with semantic annotations for our approach. Stylegan consists of mapping network and generator networks which allow to create sythetic images. First we generate $N$ images and save their intermediate feature maps produced by the generator network. Clustering allows us to find semantic regions of the generated images. After the clusters are found, a much bigger set of images is generated and using their feature maps we attribute each pixel to one of the previously discovered semantic clusters. A segmentation network is later on trained on the synthetic dataset.
  • Figure 4: Figure demonstrates a process of creating a synthetic annotations for rare classes that were not discovered during clustering. First, a latent direction in the Stylegan is learnt that adds a desired semantic class to almost every generated sample using the method of shen2020interfaceganpatashnik2021styleclip. In this case, the vector $G$ represents the a text promt "a person with glasses". After that, almost every sample has the desired attribute and it naturally appears as one of the clusters. As it can be seen, semantic region representing glasses is indeed represented by one of the clusters.
  • Figure 5: Figure demonstrates a process of classification of previously discovered clusters. Given a set of text prompts of desired classes and a set of generated images with corresponding clusters, we embed both text and image regions using pretrained text encoder and image encoders of CLIP radford2021learning. After that we compute pairwise dot products between text and cluster embeddings. Each cluster is assigned to a text prompt that results in a biggest dot product value. For example, in a given set of images all clusters containing hair will be classified as "hair".
  • ...and 4 more figures