Table of Contents
Fetching ...

DatUS^2: Data-driven Unsupervised Semantic Segmentation with Pre-trained Self-supervised Vision Transformer

Sonal Kumar, Arijit Sur, Rashmi Dutta Baruah

TL;DR

This work proposes a novel data-driven framework, DatUS, to perform unsupervised dense semantic segmentation (DSS) as a downstream task, which achieves a competitive level of accuracy for a large-scale COCO dataset.

Abstract

Successive proposals of several self-supervised training schemes continue to emerge, taking one step closer to developing a universal foundation model. In this process, the unsupervised downstream tasks are recognized as one of the evaluation methods to validate the quality of visual features learned with a self-supervised training scheme. However, unsupervised dense semantic segmentation has not been explored as a downstream task, which can utilize and evaluate the quality of semantic information introduced in patch-level feature representations during self-supervised training of a vision transformer. Therefore, this paper proposes a novel data-driven approach for unsupervised semantic segmentation (DatUS^2) as a downstream task. DatUS^2 generates semantically consistent and dense pseudo annotate segmentation masks for the unlabeled image dataset without using any visual-prior or synchronized data. We compare these pseudo-annotated segmentation masks with ground truth masks for evaluating recent self-supervised training schemes to learn shared semantic properties at the patch level and discriminative semantic properties at the segment level. Finally, we evaluate existing state-of-the-art self-supervised training schemes with our proposed downstream task, i.e., DatUS^2. Also, the best version of DatUS^2 outperforms the existing state-of-the-art method for the unsupervised dense semantic segmentation task with 15.02% MiOU and 21.47% Pixel accuracy on the SUIM dataset. It also achieves a competitive level of accuracy for a large-scale and complex dataset, i.e., the COCO dataset.

DatUS^2: Data-driven Unsupervised Semantic Segmentation with Pre-trained Self-supervised Vision Transformer

TL;DR

This work proposes a novel data-driven framework, DatUS, to perform unsupervised dense semantic segmentation (DSS) as a downstream task, which achieves a competitive level of accuracy for a large-scale COCO dataset.

Abstract

Successive proposals of several self-supervised training schemes continue to emerge, taking one step closer to developing a universal foundation model. In this process, the unsupervised downstream tasks are recognized as one of the evaluation methods to validate the quality of visual features learned with a self-supervised training scheme. However, unsupervised dense semantic segmentation has not been explored as a downstream task, which can utilize and evaluate the quality of semantic information introduced in patch-level feature representations during self-supervised training of a vision transformer. Therefore, this paper proposes a novel data-driven approach for unsupervised semantic segmentation (DatUS^2) as a downstream task. DatUS^2 generates semantically consistent and dense pseudo annotate segmentation masks for the unlabeled image dataset without using any visual-prior or synchronized data. We compare these pseudo-annotated segmentation masks with ground truth masks for evaluating recent self-supervised training schemes to learn shared semantic properties at the patch level and discriminative semantic properties at the segment level. Finally, we evaluate existing state-of-the-art self-supervised training schemes with our proposed downstream task, i.e., DatUS^2. Also, the best version of DatUS^2 outperforms the existing state-of-the-art method for the unsupervised dense semantic segmentation task with 15.02% MiOU and 21.47% Pixel accuracy on the SUIM dataset. It also achieves a competitive level of accuracy for a large-scale and complex dataset, i.e., the COCO dataset.
Paper Structure (21 sections, 3 equations, 9 figures, 7 tables, 1 algorithm)

This paper contains 21 sections, 3 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: An overview of the proposed method, i.e., data-driven unsupervised semantic segmentation with pre-trained self-supervised vision transformer (DatUS$^{2}$). The first three steps of the proposed method, i.e., Extract Patch Embeddings, Construct Affinity Graph, and Discover Image Segments, operate on a single image at a time. After applying these three steps to each image in the dataset, the method proceeds with the remaining steps, i.e., Segment-wise Pseudo Labeling, Create Initial Pseudo-annotated Masks, and Pseudo-mask De-noising and Smoothing.
  • Figure 2: Constructing affinity graph using key feature set $K_{P}$ extracted from pre-trained self-supervised vision transformer.
  • Figure 3: The process of discovering segments from an affinity graph with unsupervised graph clustering, i.e., Louvain Clustering algorithm, followed by the further decomposition step. The red box highlights the noisy sub-segment of resulting segments.
  • Figure 4: The process of the segment-wise pseudo-labeling to generate initial pseudo-annotated segmentation masks.
  • Figure 5: De-noising initial pseudo-annotated segmentation masks from the second last step using deep learning-based segmentation model.
  • ...and 4 more figures