Scaling Up Deep Clustering Methods Beyond ImageNet-1K

Nikolas Adaloglou; Felix Michels; Kaspar Senft; Diana Petrusheva; Markus Kollmann

Scaling Up Deep Clustering Methods Beyond ImageNet-1K

Nikolas Adaloglou, Felix Michels, Kaspar Senft, Diana Petrusheva, Markus Kollmann

TL;DR

This work systematically expands clustering evaluation to large-scale, realistic data by constructing ImageNet21K-based benchmarks that isolate class imbalance, granularity, easy-to-classify subsets, and multi-label signals. It shows that feature-based clustering methods TEMI and SCANv2 generally outperform $k$-means on these large-scale benchmarks, though the gains narrow as the dataset grows and becomes imbalanced or Coarser-grained. The study also demonstrates that easy-to-classify and multi-label scenarios reveal substantial gaps in $k$-means performance and that non-primary cluster predictions can reflect meaningful, higher-level semantics. Collectively, the benchmarks and findings advocate for broader large-scale evaluation beyond ImageNet-1K to better assess clustering methods in real-world, hierarchical, and multi-label contexts.

Abstract

Deep image clustering methods are typically evaluated on small-scale balanced classification datasets while feature-based $k$-means has been applied on proprietary billion-scale datasets. In this work, we explore the performance of feature-based deep clustering approaches on large-scale benchmarks whilst disentangling the impact of the following data-related factors: i) class imbalance, ii) class granularity, iii) easy-to-recognize classes, and iv) the ability to capture multiple classes. Consequently, we develop multiple new benchmarks based on ImageNet21K. Our experimental analysis reveals that feature-based $k$-means is often unfairly evaluated on balanced datasets. However, deep clustering methods outperform $k$-means across most large-scale benchmarks. Interestingly, $k$-means underperforms on easy-to-classify benchmarks by large margins. The performance gap, however, diminishes on the highest data regimes such as ImageNet21K. Finally, we find that non-primary cluster predictions capture meaningful classes (i.e. coarser classes).

Scaling Up Deep Clustering Methods Beyond ImageNet-1K

TL;DR

-means on these large-scale benchmarks, though the gains narrow as the dataset grows and becomes imbalanced or Coarser-grained. The study also demonstrates that easy-to-classify and multi-label scenarios reveal substantial gaps in

-means performance and that non-primary cluster predictions can reflect meaningful, higher-level semantics. Collectively, the benchmarks and findings advocate for broader large-scale evaluation beyond ImageNet-1K to better assess clustering methods in real-world, hierarchical, and multi-label contexts.

Abstract

Deep image clustering methods are typically evaluated on small-scale balanced classification datasets while feature-based

-means has been applied on proprietary billion-scale datasets. In this work, we explore the performance of feature-based deep clustering approaches on large-scale benchmarks whilst disentangling the impact of the following data-related factors: i) class imbalance, ii) class granularity, iii) easy-to-recognize classes, and iv) the ability to capture multiple classes. Consequently, we develop multiple new benchmarks based on ImageNet21K. Our experimental analysis reveals that feature-based

-means is often unfairly evaluated on balanced datasets. However, deep clustering methods outperform

-means across most large-scale benchmarks. Interestingly,

-means underperforms on easy-to-classify benchmarks by large margins. The performance gap, however, diminishes on the highest data regimes such as ImageNet21K. Finally, we find that non-primary cluster predictions capture meaningful classes (i.e. coarser classes).

Paper Structure (29 sections, 2 equations, 11 figures, 5 tables)

This paper contains 29 sections, 2 equations, 11 figures, 5 tables.

Introduction
Related work
Background, materials and methods
New clustering benchmarks based on ImageNet21K
Quantifying the sensitivity to class imbalance
Quantifying the sensitivity to the class granularity
Easy-to-classify benchmarks: model-based ImageNet21K subsets
Multi-label clustering benchmarks and metrics
Experimental results
Impact of class imbalance
Impact of class granularity: coarse and fine-grained benchmarks
Results from easy-to-classify benchmarks: model-based ImageNet21K subsets
Multi-label clustering evalutions
Discussion, limitations and future work
Conclusion
...and 14 more sections

Figures (11)

Figure 1: Reassessed ImageNet21K samples using parent hierarchical zero-shot label refining (p-HZR). The original GT label is shown on top, while each column shows the reassessed label using openCLIP ViT-G openclip in conjunction with the semantic tree.
Figure 2: Left: Clustering accuracy in % (y-axis) versus maximum hierarchy depth $d$ (x-axis) by mapping each ImageNet21K class to its semantic ancestor at depth $d$. Right: Clustering accuracy in % (y-axis) versus interval $s$ (x-axis) centered around the median $m_s =[50-s,50+s]$ based on the ImageNet21K class histogram. Best viewed in color.
Figure 3: Left and middle: Calibration plots and expected calibration error (ECE $\downarrow$) for SCANv2 and TEMI on ImageNet-1K using MAE-R ViT-H alkin2024mae_refine. Right: Clustering ACC difference between TEMI and SCANv2 on ImageNet-1K across various pre-trained feature extractors. IN21K refers to ImageNet21K, and CLIP ViT-B and ViT-L use the weights from radford2021clip .
Figure 4: We measure the dependence of the linear probing accuracy x-axis versus the alignment score $\mathcal{L}_\mathsf{align}$ (left), the Davies-Bouldin score (DBS) dbs1979cluster (middle) and the Sihlouette score (right) using iBOT ViT-L trained on ImageNet21K. $R^2$ is the coefficient of determination. Best viewed in color.
Figure 5: We show that increasing the number of NN increases the ACC of TEMI but $k$-means still outperforms TEMI on the coarse ImageNet21K dataset benchmarks. Left: ImageNet21K with a maximum class depth of 1 (root labels on the semantic tree). Right: ImageNet coarse benchmark with maximum class depth of 2.
...and 6 more figures

Scaling Up Deep Clustering Methods Beyond ImageNet-1K

TL;DR

Abstract

Scaling Up Deep Clustering Methods Beyond ImageNet-1K

Authors

TL;DR

Abstract

Table of Contents

Figures (11)