Table of Contents
Fetching ...

Language-Assisted Image Clustering Guided by Discriminative Relational Signals and Adaptive Semantic Centers

Jun Ma, Xu Zhang, Zhengxing Jiao, Yaxin Hou, Hui Liu, Junhui Hou, Yuheng Jia

Abstract

Language-Assisted Image Clustering (LAIC) augments the input images with additional texts with the help of vision-language models (VLMs) to promote clustering performance. Despite recent progress, existing LAIC methods often overlook two issues: (i) textual features constructed for each image are highly similar, leading to weak inter-class discriminability; (ii) the clustering step is restricted to pre-built image-text alignments, limiting the potential for better utilization of the text modality. To address these issues, we propose a new LAIC framework with two complementary components. First, we exploit cross-modal relations to produce more discriminative self-supervision signals for clustering, as it compatible with most VLMs training mechanisms. Second, we learn category-wise continuous semantic centers via prompt learning to produce the final clustering assignments. Extensive experiments on eight benchmark datasets demonstrate that our method achieves an average improvement of 2.6% over state-of-the-art methods, and the learned semantic centers exhibit strong interpretability. Code is available in the supplementary material.

Language-Assisted Image Clustering Guided by Discriminative Relational Signals and Adaptive Semantic Centers

Abstract

Language-Assisted Image Clustering (LAIC) augments the input images with additional texts with the help of vision-language models (VLMs) to promote clustering performance. Despite recent progress, existing LAIC methods often overlook two issues: (i) textual features constructed for each image are highly similar, leading to weak inter-class discriminability; (ii) the clustering step is restricted to pre-built image-text alignments, limiting the potential for better utilization of the text modality. To address these issues, we propose a new LAIC framework with two complementary components. First, we exploit cross-modal relations to produce more discriminative self-supervision signals for clustering, as it compatible with most VLMs training mechanisms. Second, we learn category-wise continuous semantic centers via prompt learning to produce the final clustering assignments. Extensive experiments on eight benchmark datasets demonstrate that our method achieves an average improvement of 2.6% over state-of-the-art methods, and the learned semantic centers exhibit strong interpretability. Code is available in the supplementary material.

Paper Structure

This paper contains 37 sections, 14 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: (a) Analysis of the phenomena in the first step of existing LAICs and our method on the ImageNet-10 dataset. (i) Jointly PCA results shows text features of noun from candidate set exhibit a more compact distribution, which are closed to each other compared to image features. (ii) Due to the phenomenon in (i), the similarity of text features across different samples are overall highly similar, leading to weak inter-class discriminability. (iii) Rows of our learned image-text representation matrix performs as new representation, showing better inter-class discriminability with in-class consistency retained. (b) Classification scores of three images (from DTD dataset) with respect to different semantic centers obtained by different methods. Zero-shot CLIP relies on semantic centers using ground-truth class names, which may not always capture accurate semantics (e.g, an image of motorboat is incorrectly assigned to "Surfing”. Our learned semantic center can be even better than class names, showing stronger discriminability.
  • Figure 2: Accuracy of each sample and its $\hat{k}$-nn ($\hat{k}=10$) belonging to the same ground-truth class on different space. We compare the results by computing similarity on images features from CLIP and rows of $\mathbf{C}$, showing our image-text representation matrix has better neighbor consistency.
  • Figure 3: Visualization of image-text representation matrix $\mathbf{C}$.
  • Figure 4: K-means performances comparisons on our method (Ours (no-train)) and two different baseline methods (CLIP K-means and TAC (no-train))across four datasets.
  • Figure 5: Accuracy gains achieved through neighborhood consistency filtering on seven datasets. Our strategy effectively mitigates clustering noise and ensures reliable supervision.
  • ...and 3 more figures