Table of Contents
Fetching ...

Delving into Spectral Clustering with Vision-Language Representations

Bo Peng, Yuanwei Hu, Bo Liu, Ling Chen, Jie Lu, Zhen Fang

TL;DR

The paper tackles unsupervised image clustering by extending spectral clustering to multi-modal vision–language representations. It introduces Neural Tangent Kernel Spectral Clustering (NTK-SC), which anchors an NTK on a set of positive nouns to couple visual proximity with semantic overlap, producing a discriminative cross-modal affinity. A Regularized Affinity Diffusion (RAD) mechanism then adaptively ensembles multiple prompt-induced affinities for robustness across diverse datasets. Empirically, NTK-SC with RAD achieves state-of-the-art performance on 16 benchmarks, including challenging domain-shift and fine-grained datasets, and demonstrates strong visualization and ablation results, confirming the value of cross-modal semantics in clustering.

Abstract

Spectral clustering is known as a powerful technique in unsupervised data analysis. The vast majority of approaches to spectral clustering are driven by a single modality, leaving the rich information in multi-modal representations untapped. Inspired by the recent success of vision-language pre-training, this paper enriches the landscape of spectral clustering from a single-modal to a multi-modal regime. Particularly, we propose Neural Tangent Kernel Spectral Clustering that leverages cross-modal alignment in pre-trained vision-language models. By anchoring the neural tangent kernel with positive nouns, i.e., those semantically close to the images of interest, we arrive at formulating the affinity between images as a coupling of their visual proximity and semantic overlap. We show that this formulation amplifies within-cluster connections while suppressing spurious ones across clusters, hence encouraging block-diagonal structures. In addition, we present a regularized affinity diffusion mechanism that adaptively ensembles affinity matrices induced by different prompts. Extensive experiments on \textbf{16} benchmarks -- including classical, large-scale, fine-grained and domain-shifted datasets -- manifest that our method consistently outperforms the state-of-the-art by a large margin.

Delving into Spectral Clustering with Vision-Language Representations

TL;DR

The paper tackles unsupervised image clustering by extending spectral clustering to multi-modal vision–language representations. It introduces Neural Tangent Kernel Spectral Clustering (NTK-SC), which anchors an NTK on a set of positive nouns to couple visual proximity with semantic overlap, producing a discriminative cross-modal affinity. A Regularized Affinity Diffusion (RAD) mechanism then adaptively ensembles multiple prompt-induced affinities for robustness across diverse datasets. Empirically, NTK-SC with RAD achieves state-of-the-art performance on 16 benchmarks, including challenging domain-shift and fine-grained datasets, and demonstrates strong visualization and ablation results, confirming the value of cross-modal semantics in clustering.

Abstract

Spectral clustering is known as a powerful technique in unsupervised data analysis. The vast majority of approaches to spectral clustering are driven by a single modality, leaving the rich information in multi-modal representations untapped. Inspired by the recent success of vision-language pre-training, this paper enriches the landscape of spectral clustering from a single-modal to a multi-modal regime. Particularly, we propose Neural Tangent Kernel Spectral Clustering that leverages cross-modal alignment in pre-trained vision-language models. By anchoring the neural tangent kernel with positive nouns, i.e., those semantically close to the images of interest, we arrive at formulating the affinity between images as a coupling of their visual proximity and semantic overlap. We show that this formulation amplifies within-cluster connections while suppressing spurious ones across clusters, hence encouraging block-diagonal structures. In addition, we present a regularized affinity diffusion mechanism that adaptively ensembles affinity matrices induced by different prompts. Extensive experiments on \textbf{16} benchmarks -- including classical, large-scale, fine-grained and domain-shifted datasets -- manifest that our method consistently outperforms the state-of-the-art by a large margin.
Paper Structure (27 sections, 5 theorems, 47 equations, 9 figures, 12 tables, 1 algorithm)

This paper contains 27 sections, 5 theorems, 47 equations, 9 figures, 12 tables, 1 algorithm.

Key Result

Lemma 1

Let $\boldsymbol{A}\in\mathbb{R}^{n\times n}$, the spectral radius of $\boldsymbol{A}$ is denoted as $\rho(\boldsymbol{A})=\max\{|\lambda|,\lambda\in\sigma(\boldsymbol{A})\}$, where $\sigma(\boldsymbol{A})$ is the spectrum of $\boldsymbol{A}$ that represents the set of all the eigenvalues. Let $\Ver

Figures (9)

  • Figure 1: Overview of the proposed NTK-based spectral clustering pipeline.
  • Figure 2: Visualization of affinity matrices on ImageNet-Dogs.
  • Figure 3: The objective value of Eq. (\ref{['eq9']}) and the clustering performance (measured by NMI) at each optimization iteration on CIFAR-10 and DTD, respectively.
  • Figure 4: Ablation analysis of clustering performance by varying the value of $\tau$ on CIFAR-10 (left), DTD (middle) and UCF101 (right), respectively.
  • Figure 5: Ablation analysis of clustering performance by varying the value of $q$ on CIFAR-10 (left), DTD (middle) and UCF101 (right), respectively.
  • ...and 4 more figures

Theorems & Definitions (5)

  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Lemma 5