Table of Contents
Fetching ...

Dirichlet Process-based Robust Clustering using the Median-of-Means Estimator

Supratik Basu, Jyotishka Ray Choudhury, Debolina Paul, Swagatam Das

TL;DR

This work presents DP-MoM, a clustering method that fuses Median-of-Means robustness with Dirichlet-process-based clustering to automatically infer the number of clusters while resisting noise and outliers. The method defines a MoM-aggregated objective over bucketed subsets and optimizes it with AdaGrad, enabling robust centroid updates and adaptive cluster growth. Theoretical results establish finite-sample concentration and asymptotic consistency with a $\mathcal{O}(n^{-1/2})$ rate, and extensive experiments on synthetic and real data demonstrate superior performance relative to state-of-the-art clustering algorithms, particularly under contamination. DP-MoM thus offers a principled, scalable approach for robust clustering without requiring a predefined number of clusters, with strong empirical and theoretical support for practical deployment.

Abstract

Clustering stands as one of the most prominent challenges in unsupervised machine learning. Among centroid-based methods, the classic $k$-means algorithm, based on Lloyd's heuristic, is widely used. Nonetheless, it is a well-known fact that $k$-means and its variants face several challenges, including heavy reliance on initial cluster centroids, susceptibility to converging into local minima of the objective function, and sensitivity to outliers and noise in the data. When data contains noise or outliers, the Median-of-Means (MoM) estimator offers a robust alternative for stabilizing centroid-based methods. On a different note, another limitation in many commonly used clustering methods is the need to specify the number of clusters beforehand. Model-based approaches, such as Bayesian nonparametric models, address this issue by incorporating infinite mixture models, which eliminate the requirement for predefined cluster counts. Motivated by these facts, in this article, we propose an efficient and automatic clustering technique by integrating the strengths of model-based and centroid-based methodologies. Our method mitigates the effect of noise on the quality of clustering; while at the same time, estimates the number of clusters. Statistical guarantees on an upper bound of clustering error, and rigorous assessment through simulated and real datasets, suggest the advantages of our proposed method over existing state-of-the-art clustering algorithms.

Dirichlet Process-based Robust Clustering using the Median-of-Means Estimator

TL;DR

This work presents DP-MoM, a clustering method that fuses Median-of-Means robustness with Dirichlet-process-based clustering to automatically infer the number of clusters while resisting noise and outliers. The method defines a MoM-aggregated objective over bucketed subsets and optimizes it with AdaGrad, enabling robust centroid updates and adaptive cluster growth. Theoretical results establish finite-sample concentration and asymptotic consistency with a rate, and extensive experiments on synthetic and real data demonstrate superior performance relative to state-of-the-art clustering algorithms, particularly under contamination. DP-MoM thus offers a principled, scalable approach for robust clustering without requiring a predefined number of clusters, with strong empirical and theoretical support for practical deployment.

Abstract

Clustering stands as one of the most prominent challenges in unsupervised machine learning. Among centroid-based methods, the classic -means algorithm, based on Lloyd's heuristic, is widely used. Nonetheless, it is a well-known fact that -means and its variants face several challenges, including heavy reliance on initial cluster centroids, susceptibility to converging into local minima of the objective function, and sensitivity to outliers and noise in the data. When data contains noise or outliers, the Median-of-Means (MoM) estimator offers a robust alternative for stabilizing centroid-based methods. On a different note, another limitation in many commonly used clustering methods is the need to specify the number of clusters beforehand. Model-based approaches, such as Bayesian nonparametric models, address this issue by incorporating infinite mixture models, which eliminate the requirement for predefined cluster counts. Motivated by these facts, in this article, we propose an efficient and automatic clustering technique by integrating the strengths of model-based and centroid-based methodologies. Our method mitigates the effect of noise on the quality of clustering; while at the same time, estimates the number of clusters. Statistical guarantees on an upper bound of clustering error, and rigorous assessment through simulated and real datasets, suggest the advantages of our proposed method over existing state-of-the-art clustering algorithms.
Paper Structure (26 sections, 3 theorems, 42 equations, 4 figures, 5 tables)

This paper contains 26 sections, 3 theorems, 42 equations, 4 figures, 5 tables.

Key Result

Theorem 4.1

Under Aass-4-iid-Aass-5-L, with probability at least $1-2e^{-2L \delta^2}$,

Figures (4)

  • Figure 1: Several state-of-the-art clustering methods fail to achieve proper clustering in presence of noisy observations (light green in color), while the performance of DP-MoM, our proposed algorithm, is nearly optimal.
  • Figure 2: Dirichlet Process Clustering using Median-of-Means (DP-MoM)
  • Figure 3: Line plots of ARI produced by different algorithms on simulated datasets, for increasingly higher number of outliers. DP-MoM is observed to perform uniformly better than all the competing methods.
  • Figure 4: Line plots of ARI produced by different algorithms, for increasingly higher number of noisy observations introduced in the Jain dataset. DP-MoM performs better than all the competing methods.

Theorems & Definitions (7)

  • Remark 1
  • Theorem 4.1
  • Corollary 4.1
  • Theorem 4.2
  • proof
  • proof
  • proof