SimSAM: Simple Siamese Representations Based Semantic Affinity Matrix for Unsupervised Image Segmentation
Chanda Grover Kamra, Indra Deep Mastan, Nitin Kumar, Debayan Gupta
TL;DR
SimSAM tackles unsupervised image segmentation by learning a semantically meaningful dense feature affinity from pre-trained DINO-ViT features using a non-contrastive Siamese framework. It predicts a semantic affinity matrix $W_{SA}$ through a simple projector $\psi$ and predictor $\pi$ under a stop-gradient loss, and combines it with a vanilla affinity $W_A$ to form $W_{feat} = W_A + \kappa W_{SA}$ for spectral segmentation. Across object and semantic segmentation benchmarks, SimSAM yields consistent improvements over deep spectral baselines, with ablations identifying a lightweight configuration (one non-linear projector and one linear predictor) as most effective. The approach leverages existing self-supervised representations to produce semantically coherent segmentations with competitive performance and practical applicability to unsupervised segmentation tasks.
Abstract
Recent developments in self-supervised learning (SSL) have made it possible to learn data representations without the need for annotations. Inspired by the non-contrastive SSL approach (SimSiam), we introduce a novel framework SIMSAM to compute the Semantic Affinity Matrix, which is significant for unsupervised image segmentation. Given an image, SIMSAM first extracts features using pre-trained DINO-ViT, then projects the features to predict the correlations of dense features in a non-contrastive way. We show applications of the Semantic Affinity Matrix in object segmentation and semantic segmentation tasks. Our code is available at https://github.com/chandagrover/SimSAM.
