Table of Contents
Fetching ...

CLASP: Adaptive Spectral Clustering for Unsupervised Per-Image Segmentation

Max Curie, Paulo da Costa

TL;DR

CLASP addresses unsupervised per-image segmentation without labels or fine-tuning by marrying self-supervised DINO patch embeddings with adaptive spectral clustering. It constructs a cosine affinity graph from patch features, computes eigenvalues $\lambda_i$ and eigengaps $\\delta_i=\\lambda_i-\\lambda_{i+1}$, and selects $K_{opt}=i^*+1$ via an eigengap–silhouette elbow, then performs a single-pass spectral partitioning with optional DenseCRF refinement. The method achieves competitive $mIoU$ and $PixelAcc$ on COCO-Stuff and ADE20K, despite its training-free design, making it a strong, reproducible baseline for large unannotated image corpora in marketing, brand safety, and content moderation workflows. By avoiding dataset-level semantic labels and multi-stage training, CLASP offers a lightweight, plug-and-play approach to per-image segmentation with practical applicability in large-scale media pipelines.

Abstract

We introduce CLASP (Clustering via Adaptive Spectral Processing), a lightweight framework for unsupervised image segmentation that operates without any labeled data or finetuning. CLASP first extracts per patch features using a self supervised ViT encoder (DINO); then, it builds an affinity matrix and applies spectral clustering. To avoid manual tuning, we select the segment count automatically with a eigengap silhouette search, and we sharpen the boundaries with a fully connected DenseCRF. Despite its simplicity and training free nature, CLASP attains competitive mIoU and pixel accuracy on COCO Stuff and ADE20K, matching recent unsupervised baselines. The zero training design makes CLASP a strong, easily reproducible baseline for large unannotated corpora especially common in digital advertising and marketing workflows such as brand safety screening, creative asset curation, and social media content moderation

CLASP: Adaptive Spectral Clustering for Unsupervised Per-Image Segmentation

TL;DR

CLASP addresses unsupervised per-image segmentation without labels or fine-tuning by marrying self-supervised DINO patch embeddings with adaptive spectral clustering. It constructs a cosine affinity graph from patch features, computes eigenvalues and eigengaps , and selects via an eigengap–silhouette elbow, then performs a single-pass spectral partitioning with optional DenseCRF refinement. The method achieves competitive and on COCO-Stuff and ADE20K, despite its training-free design, making it a strong, reproducible baseline for large unannotated image corpora in marketing, brand safety, and content moderation workflows. By avoiding dataset-level semantic labels and multi-stage training, CLASP offers a lightweight, plug-and-play approach to per-image segmentation with practical applicability in large-scale media pipelines.

Abstract

We introduce CLASP (Clustering via Adaptive Spectral Processing), a lightweight framework for unsupervised image segmentation that operates without any labeled data or finetuning. CLASP first extracts per patch features using a self supervised ViT encoder (DINO); then, it builds an affinity matrix and applies spectral clustering. To avoid manual tuning, we select the segment count automatically with a eigengap silhouette search, and we sharpen the boundaries with a fully connected DenseCRF. Despite its simplicity and training free nature, CLASP attains competitive mIoU and pixel accuracy on COCO Stuff and ADE20K, matching recent unsupervised baselines. The zero training design makes CLASP a strong, easily reproducible baseline for large unannotated corpora especially common in digital advertising and marketing workflows such as brand safety screening, creative asset curation, and social media content moderation

Paper Structure

This paper contains 25 sections, 8 equations, 4 figures, 1 table, 1 algorithm.

Figures (4)

  • Figure 1: Comparative analysis of image segmentation methods on the COCO-Stuff dataset. The first column shows the original images, while subsequent columns present segmentation outputs from U2Seg, Deep Spectral Segmentation, and our proposed method, CLASP, in both its Patch-based and Pixel-based variants. Despite its simplicity, CLASP produces structurally consistent and visually interpretable segmentations that can serve as fast, training-free region proposals — particularly useful in large-scale multimedia workflows where semantic alignment is not a requirement.
  • Figure 2: Illustrated workflow for semantic segmentation using CLASP. The process begins with image patch extraction, followed by feature encoding using a DINO (ViT) model. An affinity matrix is then constructed based on the extracted features, which undergoes eigenvector decomposition. The eigengap heuristic and silhouette analysis determine the number of clusters, guiding the spectral clustering process. The resulting patch-level segmentation is further refined using a DenseCRF, producing a final pixel-level segmentation output.
  • Figure 3: This figure illustrates the eigengap heuristic method. The blue dots represent the data points. The solid blue line, labeled $T$, connects the first and last points. From each data point, a green dashed line is drawn perpendicular to $T$, with length $d$. The red dot marks the “elbow” point, where $d$ is maximized.
  • Figure 4: Segmentation results on out-of-distribution samples w.r.t. to DINO's training data. Each row shows the original image (left), followed by masks from U2Seg, Deep Spectral Segmentation, and our method CLASP (Patch and Pixel variants). In the top row, CLASP cleanly separates two adjacent animal instances, where other methods merge or miss them. In the second row, CLASP avoids the over-segmentation seen in Deep Spectral while preserving object boundaries. In the final row, although CLASP over-segments the dog, it produces more distinct background segmentation, offering clearer region separation overall. These examples illustrate CLASP’s ability to produce structurally coherent, training-free masks.