Table of Contents
Fetching ...

Rethinking Alignment and Uniformity in Unsupervised Semantic Segmentation

Daoan Zhang, Chenming Li, Haoquan Li, Wenjian Huang, Lingyun Huang, Jianguo Zhang

TL;DR

This work tackles unsupervised image semantic segmentation (UISS) by analyzing the limitations of mutual-information (MI)-based supervision and proposing a robust framework called Semantic Attention Network (SAN). SAN introduces the Semantic Attention (SEAT) module to dynamically align pixel-wise embeddings with batch-wise semantic representations, leveraging a CNN-based pixel encoder and a Vision Transformer (ViT) semantic backbone. To combat representation collapse common in MI-based approaches, the authors employ an image reconstruction constraint and enforce orthogonality of semantic embeddings, ensuring both alignment and uniformity across pixel and semantic spaces. Empirical results on five challenging datasets show that SAN achieves state-of-the-art performance among unpretrained methods and competitive results with pretrained baselines, highlighting its robustness and effectiveness for unsupervised dense semantic segmentation.

Abstract

Unsupervised image semantic segmentation(UISS) aims to match low-level visual features with semantic-level representations without outer supervision. In this paper, we address the critical properties from the view of feature alignments and feature uniformity for UISS models. We also make a comparison between UISS and image-wise representation learning. Based on the analysis, we argue that the existing MI-based methods in UISS suffer from representation collapse. By this, we proposed a robust network called Semantic Attention Network(SAN), in which a new module Semantic Attention(SEAT) is proposed to generate pixel-wise and semantic features dynamically. Experimental results on multiple semantic segmentation benchmarks show that our unsupervised segmentation framework specializes in catching semantic representations, which outperforms all the unpretrained and even several pretrained methods.

Rethinking Alignment and Uniformity in Unsupervised Semantic Segmentation

TL;DR

This work tackles unsupervised image semantic segmentation (UISS) by analyzing the limitations of mutual-information (MI)-based supervision and proposing a robust framework called Semantic Attention Network (SAN). SAN introduces the Semantic Attention (SEAT) module to dynamically align pixel-wise embeddings with batch-wise semantic representations, leveraging a CNN-based pixel encoder and a Vision Transformer (ViT) semantic backbone. To combat representation collapse common in MI-based approaches, the authors employ an image reconstruction constraint and enforce orthogonality of semantic embeddings, ensuring both alignment and uniformity across pixel and semantic spaces. Empirical results on five challenging datasets show that SAN achieves state-of-the-art performance among unpretrained methods and competitive results with pretrained baselines, highlighting its robustness and effectiveness for unsupervised dense semantic segmentation.

Abstract

Unsupervised image semantic segmentation(UISS) aims to match low-level visual features with semantic-level representations without outer supervision. In this paper, we address the critical properties from the view of feature alignments and feature uniformity for UISS models. We also make a comparison between UISS and image-wise representation learning. Based on the analysis, we argue that the existing MI-based methods in UISS suffer from representation collapse. By this, we proposed a robust network called Semantic Attention Network(SAN), in which a new module Semantic Attention(SEAT) is proposed to generate pixel-wise and semantic features dynamically. Experimental results on multiple semantic segmentation benchmarks show that our unsupervised segmentation framework specializes in catching semantic representations, which outperforms all the unpretrained and even several pretrained methods.
Paper Structure (16 sections, 9 equations, 4 figures, 5 tables)

This paper contains 16 sections, 9 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Paradigms of Unsupervised Semantic Segmentation. Clustering methods(left) map the image into latent space and use cluster to classify semantics embeddings; Contrastive methods(middle) compare MI of different views from different images; SAN (right) maps both pixel-wise and semantic embeddings into latent space, then match them via the SEAT module. Specific supervision is applied to the output of SEAT.
  • Figure 2: The architecture of SAN Network. The pixel-wise encoder maps the image into a high dimensional latent space and produces pixel-wise embeddings to be clustered. The semantic-wise Generator generates semantic embeddings, serving as centers to align and group the pixel-wise features. After processing by the SEAT module, the consistency sustainer maintains the consistency between the pixel-wise features (from the pixel-wise encoder) and generated pixel-wise features (reconstructed from SEAT) to provide a constraint for the model learning. The $L_{Rec}$ and $L_{Con}$ are the $L_2$ loss function.
  • Figure 3: The network structures of SAN. (a) Per-pixel Perceptron; (b) Multi-head Generator; (c) Token Matcher. Legend: (Conv, $K\times K$, $D$, $S$, $P$): Convolution with filter size $K\times K$, padding $D$, stride $S$ and channel $P$; (MLP, M, N): MLP with input dimension M and output dimension N; (BN): Batch Normalization; (GELU): Gaussian Error Linerar Units; (Sigmoid): Sigmoid Function.
  • Figure 4: Qualitative comparison of SAN. Left: Results on COCO-Stuff-3, for this dataset, we compare our method with IICji2019invariant which uses cluster methods. Our results even label more fine-grained annotations than ground truth(Line 2); Right: Results on COCO-Stuff-27, for the tougher dataset, we compare our result with PiCIEcho2021picie which is an outstanding pretrained-model-based method.