Table of Contents
Fetching ...

Image Clustering with External Guidance

Yunfan Li, Peng Hu, Dezhong Peng, Jiancheng Lv, Jianping Fan, Xi Peng

TL;DR

This work introduces Text-Aided Clustering (TAC), a paradigm that uses external textual knowledge to supervise image clustering. TAC constructs a discriminative text space by selecting WordNet nouns tied to image semantics and retrieves text counterparts for images, then couples text and image representations through cross-modal mutual distillation with cluster heads. The approach yields state-of-the-art clustering on eight benchmarks, including ImageNet-1K, even surpassing zero-shot CLIP in many settings. By leveraging external knowledge and simple text-driven baselines, TAC demonstrates that external supervision can substantially enhance clustering without extensive internal supervision signals, offering a practical avenue for vision-language integration in unsupervised learning.

Abstract

The core of clustering is incorporating prior knowledge to construct supervision signals. From classic k-means based on data compactness to recent contrastive clustering guided by self-supervision, the evolution of clustering methods intrinsically corresponds to the progression of supervision signals. At present, substantial efforts have been devoted to mining internal supervision signals from data. Nevertheless, the abundant external knowledge such as semantic descriptions, which naturally conduces to clustering, is regrettably overlooked. In this work, we propose leveraging external knowledge as a new supervision signal to guide clustering, even though it seems irrelevant to the given data. To implement and validate our idea, we design an externally guided clustering method (Text-Aided Clustering, TAC), which leverages the textual semantics of WordNet to facilitate image clustering. Specifically, TAC first selects and retrieves WordNet nouns that best distinguish images to enhance the feature discriminability. Then, to improve image clustering performance, TAC collaborates text and image modalities by mutually distilling cross-modal neighborhood information. Experiments demonstrate that TAC achieves state-of-the-art performance on five widely used and three more challenging image clustering benchmarks, including the full ImageNet-1K dataset.

Image Clustering with External Guidance

TL;DR

This work introduces Text-Aided Clustering (TAC), a paradigm that uses external textual knowledge to supervise image clustering. TAC constructs a discriminative text space by selecting WordNet nouns tied to image semantics and retrieves text counterparts for images, then couples text and image representations through cross-modal mutual distillation with cluster heads. The approach yields state-of-the-art clustering on eight benchmarks, including ImageNet-1K, even surpassing zero-shot CLIP in many settings. By leveraging external knowledge and simple text-driven baselines, TAC demonstrates that external supervision can substantially enhance clustering without extensive internal supervision signals, offering a practical avenue for vision-language integration in unsupervised learning.

Abstract

The core of clustering is incorporating prior knowledge to construct supervision signals. From classic k-means based on data compactness to recent contrastive clustering guided by self-supervision, the evolution of clustering methods intrinsically corresponds to the progression of supervision signals. At present, substantial efforts have been devoted to mining internal supervision signals from data. Nevertheless, the abundant external knowledge such as semantic descriptions, which naturally conduces to clustering, is regrettably overlooked. In this work, we propose leveraging external knowledge as a new supervision signal to guide clustering, even though it seems irrelevant to the given data. To implement and validate our idea, we design an externally guided clustering method (Text-Aided Clustering, TAC), which leverages the textual semantics of WordNet to facilitate image clustering. Specifically, TAC first selects and retrieves WordNet nouns that best distinguish images to enhance the feature discriminability. Then, to improve image clustering performance, TAC collaborates text and image modalities by mutually distilling cross-modal neighborhood information. Experiments demonstrate that TAC achieves state-of-the-art performance on five widely used and three more challenging image clustering benchmarks, including the full ImageNet-1K dataset.
Paper Structure (26 sections, 11 equations, 5 figures, 7 tables)

This paper contains 26 sections, 11 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: The evolution of clustering methods could be roughly divided into three eras, including i) classic clustering, which designs clustering strategies based on data distribution assumptions; ii) deep clustering, which extracts clustering-favorable features with deep neural networks, and iii) self-supervised clustering, which constructs supervision signals through data augmentations or momentum strategies. In this work, instead of mining the internal supervision, we propose exploring external knowledge to facilitate image clustering. We categorize such a novel paradigm as iv) externally guided clustering. By leveraging the semantics in the text modality, our TAC pushes the clustering accuracy to a new state-of-the-art.
  • Figure 2: Our observations with two image examples from the ImageNet-Dogs dataset as a showcase. For each example, we show the manually annotated class names and the nouns obtained by the proposed TAC, as well as the zero-shot classification probabilities. From the example, one could arrive at two observations, namely, i) visually similar samples could be better distinguished in the text modality, and ii) manually annotated class names are not always the best semantic description. As shown, zero-shot CLIP falsely classifies both images to the Blenheim Spaniel class (probably due to the word Spaniel), whereas the nouns obtained by our TAC successfully separate them. Such observations suggest a great opportunity to leverage the external knowledge (hidden in the text modality in this showcase) to facilitate image clustering.
  • Figure 3: Overview of the proposed TAC. (Left) TAC first classifies all nouns from WordNet to image semantic centers, and selects the most discriminative nouns to construct the text space. After that, TAC retrieves nouns for each image to compute its counterpart in the text space. By concatenating the image and retrieved text, we arrive at an extremely simple baseline without any additional training. (Right) To better collaborate the text and image modalities, TAC trains cluster heads by mutually distilling the neighborhood information. In brief, TAC encourages images to have consistent cluster assignments with the nearest neighbors of their counterparts in the text embedding space, and vice versa. Such a cross-modal mutual distillation strategy further boosts the clustering performance of TAC.
  • Figure 4: Visualization of features extracted by different methods on the ImageNet-Dogs training set, with the corresponding k-means clustering ARI annotated on the top. a) image embedding directly obtained from the CLIP image encoder; b) text counterparts constructed by TAC; c) concatenation of images and text counterparts; d) representation learned by TAC through cross-modal mutual distillation.
  • Figure 5: Analyses on three hyper-parameters in the proposed TAC. The first two hyper-parameters influence both TAC with and without training. The last hyper-parameter only influences the cross-modal mutual distillation process of TAC.