Table of Contents
Fetching ...

SimSAM: Simple Siamese Representations Based Semantic Affinity Matrix for Unsupervised Image Segmentation

Chanda Grover Kamra, Indra Deep Mastan, Nitin Kumar, Debayan Gupta

TL;DR

SimSAM tackles unsupervised image segmentation by learning a semantically meaningful dense feature affinity from pre-trained DINO-ViT features using a non-contrastive Siamese framework. It predicts a semantic affinity matrix $W_{SA}$ through a simple projector $\psi$ and predictor $\pi$ under a stop-gradient loss, and combines it with a vanilla affinity $W_A$ to form $W_{feat} = W_A + \kappa W_{SA}$ for spectral segmentation. Across object and semantic segmentation benchmarks, SimSAM yields consistent improvements over deep spectral baselines, with ablations identifying a lightweight configuration (one non-linear projector and one linear predictor) as most effective. The approach leverages existing self-supervised representations to produce semantically coherent segmentations with competitive performance and practical applicability to unsupervised segmentation tasks.

Abstract

Recent developments in self-supervised learning (SSL) have made it possible to learn data representations without the need for annotations. Inspired by the non-contrastive SSL approach (SimSiam), we introduce a novel framework SIMSAM to compute the Semantic Affinity Matrix, which is significant for unsupervised image segmentation. Given an image, SIMSAM first extracts features using pre-trained DINO-ViT, then projects the features to predict the correlations of dense features in a non-contrastive way. We show applications of the Semantic Affinity Matrix in object segmentation and semantic segmentation tasks. Our code is available at https://github.com/chandagrover/SimSAM.

SimSAM: Simple Siamese Representations Based Semantic Affinity Matrix for Unsupervised Image Segmentation

TL;DR

SimSAM tackles unsupervised image segmentation by learning a semantically meaningful dense feature affinity from pre-trained DINO-ViT features using a non-contrastive Siamese framework. It predicts a semantic affinity matrix through a simple projector and predictor under a stop-gradient loss, and combines it with a vanilla affinity to form for spectral segmentation. Across object and semantic segmentation benchmarks, SimSAM yields consistent improvements over deep spectral baselines, with ablations identifying a lightweight configuration (one non-linear projector and one linear predictor) as most effective. The approach leverages existing self-supervised representations to produce semantically coherent segmentations with competitive performance and practical applicability to unsupervised segmentation tasks.

Abstract

Recent developments in self-supervised learning (SSL) have made it possible to learn data representations without the need for annotations. Inspired by the non-contrastive SSL approach (SimSiam), we introduce a novel framework SIMSAM to compute the Semantic Affinity Matrix, which is significant for unsupervised image segmentation. Given an image, SIMSAM first extracts features using pre-trained DINO-ViT, then projects the features to predict the correlations of dense features in a non-contrastive way. We show applications of the Semantic Affinity Matrix in object segmentation and semantic segmentation tasks. Our code is available at https://github.com/chandagrover/SimSAM.
Paper Structure (10 sections, 10 equations, 14 figures, 7 tables)

This paper contains 10 sections, 10 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Top row shows the qualitative results of object segmentation (a-e), and the bottom row presents the results of semantic segmentation (f-i).
  • Figure 2: The left side illustrates the SimSAM framework. The right side provides an overview of the computation of the semantic affinity matrix $W_{SA}$. First, we extract features using DINO-ViT. Then, two views ($\alpha_i$ and $\beta_i$) are obtained with Random Affine transformations. These views are processed by projector network $\psi$ (consists of a non-linear layer), followed by a predictor $\pi$ (consists of a linear layer), and loss $\mathcal{L}$ is minimized to train the projector and predictor (with stop-grad). Finally, $W_{SA}$ is computed for spectral segmentation.
  • Figure 3: Segmentation Masks obtained with our method and baseline methods (Deep Cut Aflalo_2023_ICCV and Deep Spectral Methods (DSM) melas2022deep). Ground Truth masks are given on the top-right of each input image.
  • Figure 4: Eigenvectors (EV) of an input image (top row left side of Fig. \ref{['fig:masks']}). The second column shows the segmentation mask, and the remaining three show the top three eigenvectors.
  • Figure 5: Semantic Segmentation Masks obtained with our method and baseline methods (DSM melas2022deep).
  • ...and 9 more figures