Table of Contents
Fetching ...

Information Maximization Clustering via Multi-View Self-Labelling

Foivos Ntelemis, Yaochu Jin, Spencer A. Thomas

TL;DR

This work proposes a novel single-phase clustering method that simultaneously learns meaningful representations and assigns the corresponding annotations, and maximizes the dependency between the integrated discrete representation and a discrete probability distribution.

Abstract

Image clustering is a particularly challenging computer vision task, which aims to generate annotations without human supervision. Recent advances focus on the use of self-supervised learning strategies in image clustering, by first learning valuable semantics and then clustering the image representations. These multiple-phase algorithms, however, increase the computational time and their final performance is reliant on the first stage. By extending the self-supervised approach, we propose a novel single-phase clustering method that simultaneously learns meaningful representations and assigns the corresponding annotations. This is achieved by integrating a discrete representation into the self-supervised paradigm through a classifier net. Specifically, the proposed clustering objective employs mutual information, and maximizes the dependency between the integrated discrete representation and a discrete probability distribution. The discrete probability distribution is derived though the self-supervised process by comparing the learnt latent representation with a set of trainable prototypes. To enhance the learning performance of the classifier, we jointly apply the mutual information across multi-crop views. Our empirical results show that the proposed framework outperforms state-of-the-art techniques with the average accuracy of 89.1% and 49.0%, respectively, on CIFAR-10 and CIFAR-100/20 datasets. Finally, the proposed method also demonstrates attractive robustness to parameter settings, making it ready to be applicable to other datasets.

Information Maximization Clustering via Multi-View Self-Labelling

TL;DR

This work proposes a novel single-phase clustering method that simultaneously learns meaningful representations and assigns the corresponding annotations, and maximizes the dependency between the integrated discrete representation and a discrete probability distribution.

Abstract

Image clustering is a particularly challenging computer vision task, which aims to generate annotations without human supervision. Recent advances focus on the use of self-supervised learning strategies in image clustering, by first learning valuable semantics and then clustering the image representations. These multiple-phase algorithms, however, increase the computational time and their final performance is reliant on the first stage. By extending the self-supervised approach, we propose a novel single-phase clustering method that simultaneously learns meaningful representations and assigns the corresponding annotations. This is achieved by integrating a discrete representation into the self-supervised paradigm through a classifier net. Specifically, the proposed clustering objective employs mutual information, and maximizes the dependency between the integrated discrete representation and a discrete probability distribution. The discrete probability distribution is derived though the self-supervised process by comparing the learnt latent representation with a set of trainable prototypes. To enhance the learning performance of the classifier, we jointly apply the mutual information across multi-crop views. Our empirical results show that the proposed framework outperforms state-of-the-art techniques with the average accuracy of 89.1% and 49.0%, respectively, on CIFAR-10 and CIFAR-100/20 datasets. Finally, the proposed method also demonstrates attractive robustness to parameter settings, making it ready to be applicable to other datasets.

Paper Structure

This paper contains 24 sections, 11 equations, 8 figures, 6 tables, 2 algorithms.

Figures (8)

  • Figure 1: A diagram presents the framework's structure and a training instance $x_i$, transformed twice through $T()$. Here $E_\psi$ denotes the encoder model, and the comparable prototypes as $\textbf{C}$. $A_\omega$ indicates the introduced classification model implemented on top of the embedding output (diagram is designed via PlotNeuralNet haris_iqbal_2018_2526396).
  • Figure 2: A scatter illustration of JSD pairwise distances between two probabilistic distributions $U^{(1)}$ and $U^{(2)}$ generated by a pre-trained encoder of the same 32 image instances, where different transformations applied in each instance.
  • Figure 3: The above confusion matrix showing the predictions and ground truth made by the proposed model on CIFAR-10 validation set.
  • Figure 4: This illustration is an interpretation of visual activation heatmap of accurate (positive - green frame) and inaccurate (negative - red frame) prediction made by our model on STL10.
  • Figure 5: Performance of IMC-SwAV made on CIFAR-10 for a limited number of training elements. x-axis presents the number of training samples (by thousand) per class.
  • ...and 3 more figures