CLASP: Adaptive Spectral Clustering for Unsupervised Per-Image Segmentation
Max Curie, Paulo da Costa
TL;DR
CLASP addresses unsupervised per-image segmentation without labels or fine-tuning by marrying self-supervised DINO patch embeddings with adaptive spectral clustering. It constructs a cosine affinity graph from patch features, computes eigenvalues $\lambda_i$ and eigengaps $\\delta_i=\\lambda_i-\\lambda_{i+1}$, and selects $K_{opt}=i^*+1$ via an eigengap–silhouette elbow, then performs a single-pass spectral partitioning with optional DenseCRF refinement. The method achieves competitive $mIoU$ and $PixelAcc$ on COCO-Stuff and ADE20K, despite its training-free design, making it a strong, reproducible baseline for large unannotated image corpora in marketing, brand safety, and content moderation workflows. By avoiding dataset-level semantic labels and multi-stage training, CLASP offers a lightweight, plug-and-play approach to per-image segmentation with practical applicability in large-scale media pipelines.
Abstract
We introduce CLASP (Clustering via Adaptive Spectral Processing), a lightweight framework for unsupervised image segmentation that operates without any labeled data or finetuning. CLASP first extracts per patch features using a self supervised ViT encoder (DINO); then, it builds an affinity matrix and applies spectral clustering. To avoid manual tuning, we select the segment count automatically with a eigengap silhouette search, and we sharpen the boundaries with a fully connected DenseCRF. Despite its simplicity and training free nature, CLASP attains competitive mIoU and pixel accuracy on COCO Stuff and ADE20K, matching recent unsupervised baselines. The zero training design makes CLASP a strong, easily reproducible baseline for large unannotated corpora especially common in digital advertising and marketing workflows such as brand safety screening, creative asset curation, and social media content moderation
