Table of Contents
Fetching ...

SiamMM: A Mixture Model Perspective on Deep Unsupervised Learning

Xiaodong Wang, Jing Huang, Kevin J Liang

TL;DR

SiamMM reframes clustering-based self-supervised learning as a mixture-model problem and introduces a two-tier EM framework that jointly learns cluster parameters and representations without relying on negative samples. It employs a non-negative soft-assignment loss, a dynamic cluster-merging strategy, and consistent centroid updates to efficiently discover semantically meaningful centroids that align with unseen labels. The approach achieves state-of-the-art performance on ImageNet SSL benchmarks, with strong linear and transfer results and insightful clustering visualizations. This mixture-model perspective offers a principled, scalable alternative to contrastive methods and sheds light on cluster structure and data labeling quality in large-scale vision data.

Abstract

Recent studies have demonstrated the effectiveness of clustering-based approaches for self-supervised and unsupervised learning. However, the application of clustering is often heuristic, and the optimal methodology remains unclear. In this work, we establish connections between these unsupervised clustering methods and classical mixture models from statistics. Through this framework, we demonstrate significant enhancements to these clustering methods, leading to the development of a novel model named SiamMM. Our method attains state-of-the-art performance across various self-supervised learning benchmarks. Inspection of the learned clusters reveals a strong resemblance to unseen ground truth labels, uncovering potential instances of mislabeling.

SiamMM: A Mixture Model Perspective on Deep Unsupervised Learning

TL;DR

SiamMM reframes clustering-based self-supervised learning as a mixture-model problem and introduces a two-tier EM framework that jointly learns cluster parameters and representations without relying on negative samples. It employs a non-negative soft-assignment loss, a dynamic cluster-merging strategy, and consistent centroid updates to efficiently discover semantically meaningful centroids that align with unseen labels. The approach achieves state-of-the-art performance on ImageNet SSL benchmarks, with strong linear and transfer results and insightful clustering visualizations. This mixture-model perspective offers a principled, scalable alternative to contrastive methods and sheds light on cluster structure and data labeling quality in large-scale vision data.

Abstract

Recent studies have demonstrated the effectiveness of clustering-based approaches for self-supervised and unsupervised learning. However, the application of clustering is often heuristic, and the optimal methodology remains unclear. In this work, we establish connections between these unsupervised clustering methods and classical mixture models from statistics. Through this framework, we demonstrate significant enhancements to these clustering methods, leading to the development of a novel model named SiamMM. Our method attains state-of-the-art performance across various self-supervised learning benchmarks. Inspection of the learned clusters reveals a strong resemblance to unseen ground truth labels, uncovering potential instances of mislabeling.

Paper Structure

This paper contains 31 sections, 15 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: SiamMM model architecture. Assuming all the embeddings are $L^2$ normalized, we cast clustering in representation learning as a von Mises-Fisher mixture model (vMFMM). The optimization objective tends to minimize the distance between an embedding and its clustering centroid (or nearest centroids) without negative samples.
  • Figure 2: Starting from different initial numbers, the number of clusters converge almost to the true number of cluster in the datasets (left: ImageNet1kdeng2009imagenet; right: ImageNet100imagenet100).
  • Figure 3: Visualization of Merged 1000 Clusters. Left: Images grouped under the same cluster label, where each row corresponds to a true class label in ImageNet. Right: Images grouped under class label "eggnog", where each row corresponds to a predicted clustering label.
  • Figure 4: Illustration of the differenct negative sampling strategies; the dot and cross indicate the embedding of a data point and a cluster centroid, respectively; the blue arrow illustrates a positive pair, while the red arrow illustrates a negative pair. (Left) sampling negative cluster centroids (\ref{['eq:loss_nce1']}) either restored in a memory bank PCL or from a batch size SwAV; (Right) sampling negative data points out of a target cluster, introduced in (\ref{['eq:loss_nce2']})
  • Figure 5: Training time per $100$ epochs for different numbers of cluster re-initializations and with different number of clusters.
  • ...and 5 more figures