Table of Contents
Fetching ...

Boosting Unsupervised Semantic Segmentation with Principal Mask Proposals

Oliver Hahn, Nikita Araslanov, Simone Schaub-Meyer, Stefan Roth

TL;DR

This work presents PriMaPs - Principal Mask Proposals - decomposing images into semantically meaningful masks based on their feature representation, which allows for unsupervised semantic segmentation by fitting class prototypes to PriMaPs with a stochastic expectation-maximization algorithm, PriMaPs-EM.

Abstract

Unsupervised semantic segmentation aims to automatically partition images into semantically meaningful regions by identifying global semantic categories within an image corpus without any form of annotation. Building upon recent advances in self-supervised representation learning, we focus on how to leverage these large pre-trained models for the downstream task of unsupervised segmentation. We present PriMaPs - Principal Mask Proposals - decomposing images into semantically meaningful masks based on their feature representation. This allows us to realize unsupervised semantic segmentation by fitting class prototypes to PriMaPs with a stochastic expectation-maximization algorithm, PriMaPs-EM. Despite its conceptual simplicity, PriMaPs-EM leads to competitive results across various pre-trained backbone models, including DINO and DINOv2, and across different datasets, such as Cityscapes, COCO-Stuff, and Potsdam-3. Importantly, PriMaPs-EM is able to boost results when applied orthogonally to current state-of-the-art unsupervised semantic segmentation pipelines. Code is available at https://github.com/visinf/primaps.

Boosting Unsupervised Semantic Segmentation with Principal Mask Proposals

TL;DR

This work presents PriMaPs - Principal Mask Proposals - decomposing images into semantically meaningful masks based on their feature representation, which allows for unsupervised semantic segmentation by fitting class prototypes to PriMaPs with a stochastic expectation-maximization algorithm, PriMaPs-EM.

Abstract

Unsupervised semantic segmentation aims to automatically partition images into semantically meaningful regions by identifying global semantic categories within an image corpus without any form of annotation. Building upon recent advances in self-supervised representation learning, we focus on how to leverage these large pre-trained models for the downstream task of unsupervised segmentation. We present PriMaPs - Principal Mask Proposals - decomposing images into semantically meaningful masks based on their feature representation. This allows us to realize unsupervised semantic segmentation by fitting class prototypes to PriMaPs with a stochastic expectation-maximization algorithm, PriMaPs-EM. Despite its conceptual simplicity, PriMaPs-EM leads to competitive results across various pre-trained backbone models, including DINO and DINOv2, and across different datasets, such as Cityscapes, COCO-Stuff, and Potsdam-3. Importantly, PriMaPs-EM is able to boost results when applied orthogonally to current state-of-the-art unsupervised semantic segmentation pipelines. Code is available at https://github.com/visinf/primaps.
Paper Structure (28 sections, 11 equations, 13 figures, 8 tables)

This paper contains 28 sections, 11 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: PriMaPs pseudo-label example. Principal mask proposals (PriMaPs) are iteratively extracted from an image (dashed arrows). Each mask is assigned a semantic class, resulting in a pseudo label. The examples are taken from the Cityscapes (top), COCO-Stuff (middle), and Potsdam-3 (bottom) datasets.
  • Figure 2: PriMaPs process. Given the dense feature embeddings $f$ of an image $I$, we compute the cosine-similarity map $M^1$ of all features $f$ to their first principal component's nearest neighbor feature. The first PriMaP $P^1$ is obtained by thresholding $M^1$. To obtain $P^2$, the features assigned to $P^1$ are masked out, and the process is repeated with the remaining features $f^2$. We repeat the PriMaPs process until the majority of features have been assigned to masks. Finally, all masks $P$ are upsampled and refined using a CRF.
  • Figure 3: (a) PriMaPs-EM architecture. An image $I$ and its augmented version $I'$ are embedded by the frozen self-supervised backbone $\mathcal{F}$, resulting in the dense features $f$ and $f'$. The segmentation prediction $y$ by the momentum class prototypes $\theta_M$ arises via the dot product with $f$. Likewise, $y'$ arises from the dot product of the running class prototypes $\theta_R$ with $f'$. Pseudo labels $P^\ast$ are constructed using PriMaPs, $I$, and $y$. We use the pseudo labels to optimize $\theta_R$, applying a focal loss. $\theta_R$ is gradually transferred to $\theta_M$ by means of an ema. (b) PriMaPs pseudo-label generation. Masks $P$ are proposed by iterative binary partitioning based on the cosine similarity of the features of any unassigned pixel to their first principal component’s nearest neighbor feature. Gray indicates these iterative steps. Next, the masks $P$ are aligned to the image $I$ using a CRF. Finally, a per-mask pseudo-class ID is assigned using majority voting based on the segmentation prediction $y$, resulting in the pseudo label $P^\ast$.
  • Figure 4: Qualitative results for the DINO ViT-B/8 baseline, PriMaPs-EM (Ours), STEGO Hamilton:2022:USS, and STEGO+PriMaPs-EM (Ours) for Cityscapes, COCO-Stuff, and Potsdam-3. Our method produces locally more consistent segmentation results reducing overall misclassification compared to the corresponding baseline.
  • Figure 5: Nearest neighbor anchoring of the principal direction in PriMaPs. Image, ground-truth label, and the first three similarity maps with respect to the principal direction (left) and their nearest neighbor (right) for all three datasets using DINO ViT-B/8. Anchoring localizes the signal for principal directions with high similarities to multiple visual concepts.
  • ...and 8 more figures