Table of Contents
Fetching ...

Contrastive Mean-Shift Learning for Generalized Category Discovery

Sua Choi, Dahyun Kang, Minsu Cho

TL;DR

This work introduces Contrastive Mean-Shift (CMS) learning for generalized category discovery (GCD), integrating a mean-shift step into contrastive representation learning to yield clustering-friendly embeddings without requiring the true number of classes ($K$). It trains a self-supervised encoder with a combined unsupervised CMS loss and a supervised loss, while iteratively estimating $K$ via agglomerative clustering on a validation set; final clustering applies multi-step mean shift before agglomerative clustering. CMS achieves state-of-the-art results on six public GCD benchmarks, including scenarios without access to the ground-truth $K$, and demonstrates robust $K$ estimation during training and effective transfer of knowledge from known to unknown classes. The approach highlights the value of non-parametric mean-shift within a learnable, contrastive framework for scalable and practical novel-class discovery.

Abstract

We address the problem of generalized category discovery (GCD) that aims to partition a partially labeled collection of images; only a small part of the collection is labeled and the total number of target classes is unknown. To address this generalized image clustering problem, we revisit the mean-shift algorithm, i.e., a classic, powerful technique for mode seeking, and incorporate it into a contrastive learning framework. The proposed method, dubbed Contrastive Mean-Shift (CMS) learning, trains an image encoder to produce representations with better clustering properties by an iterative process of mean shift and contrastive update. Experiments demonstrate that our method, both in settings with and without the total number of clusters being known, achieves state-of-the-art performance on six public GCD benchmarks without bells and whistles.

Contrastive Mean-Shift Learning for Generalized Category Discovery

TL;DR

This work introduces Contrastive Mean-Shift (CMS) learning for generalized category discovery (GCD), integrating a mean-shift step into contrastive representation learning to yield clustering-friendly embeddings without requiring the true number of classes (). It trains a self-supervised encoder with a combined unsupervised CMS loss and a supervised loss, while iteratively estimating via agglomerative clustering on a validation set; final clustering applies multi-step mean shift before agglomerative clustering. CMS achieves state-of-the-art results on six public GCD benchmarks, including scenarios without access to the ground-truth , and demonstrates robust estimation during training and effective transfer of knowledge from known to unknown classes. The approach highlights the value of non-parametric mean-shift within a learnable, contrastive framework for scalable and practical novel-class discovery.

Abstract

We address the problem of generalized category discovery (GCD) that aims to partition a partially labeled collection of images; only a small part of the collection is labeled and the total number of target classes is unknown. To address this generalized image clustering problem, we revisit the mean-shift algorithm, i.e., a classic, powerful technique for mode seeking, and incorporate it into a contrastive learning framework. The proposed method, dubbed Contrastive Mean-Shift (CMS) learning, trains an image encoder to produce representations with better clustering properties by an iterative process of mean shift and contrastive update. Experiments demonstrate that our method, both in settings with and without the total number of clusters being known, achieves state-of-the-art performance on six public GCD benchmarks without bells and whistles.
Paper Structure (29 sections, 10 equations, 5 figures, 17 tables, 1 algorithm)

This paper contains 29 sections, 10 equations, 5 figures, 17 tables, 1 algorithm.

Figures (5)

  • Figure 1: Contrastive Mean-Shift (CMS) learning. By integrating mean shift fukunaga1975estimationmeanshift into contrastive learning zhuang2019localchen2020simple, the proposed method learns an embedding space such that the mean-shifted embeddings of identical images $x_{i}$ and $x_{i}^{+}$ draw together and those of distinct images $x_{i}$ and $x_{j}$ push apart from each other.
  • Figure 2: Contrastive Mean-Shift Learning. Given a collection of images, each initial image embedding ${\bm{v}}_i$ from an image encoder takes a single step of mean shift to be ${\bm{z}}_i$ by aggregating its $k$ nearest neighbors with a weight kernel $\varphi(\cdot)$. The encoder network is then updated by contrastive learning with the mean-shifted embeddings, which draws a mean-shifted embedding of image $x_{i}$ and that of its augmented image $x_{i}^{+}$ closer and pushes those of distinct images apart from each other. See text for details.
  • Figure 3: Clustering accuracy over mean-shift iterations on CUB.
  • Figure 4: $k$NN retrieved images of the initial embedding ${\bm{v}}$ and mean-shifted embedding ${\bm{z}}$ on CUB-200-2011. Green denotes the correct class and red an incorrect class.
  • Figure 5: $t$SNE tsne visualization on ImageNet100. Each Color indicates a ground-truth class.