Convex Clustering Redefined: Robust Learning with the Median of Means Estimator
Sourav De, Koustav Chowdhury, Bibhabasu Mandal, Sagar Ghosh, Swagatam Das, Debolina Paul, Saptarshi Chakraborty
TL;DR
This work addresses robust clustering without requiring the number of clusters by integrating convex clustering with the Median-of-Means estimator. The proposed method, COMET, introduces random binning and pairwise-distance clipping, optimized via ADAM, and yields cluster assignments from a centroid-graph post-processing step. The authors establish finite-sample deviation bounds and weak consistency, and demonstrate through extensive synthetic and real-data experiments that COMET outperforms state-of-the-art baselines in robustness and efficiency, including challenging brain-microarray datasets. Overall, COMET offers a scalable, outlier-resistant alternative to traditional clustering that automatically adapts to data contamination while delivering reliable clustering structure with theoretical guarantees.
Abstract
Clustering approaches that utilize convex loss functions have recently attracted growing interest in the formation of compact data clusters. Although classical methods like k-means and its wide family of variants are still widely used, all of them require the number of clusters k to be supplied as input, and many are notably sensitive to initialization. Convex clustering provides a more stable alternative by formulating the clustering task as a convex optimization problem, ensuring a unique global solution. However, it faces challenges in handling high-dimensional data, especially in the presence of noise and outliers. Additionally, strong fusion regularization, controlled by the tuning parameter, can hinder effective cluster formation within a convex clustering framework. To overcome these challenges, we introduce a robust approach that integrates convex clustering with the Median of Means (MoM) estimator, thus developing an outlier-resistant and efficient clustering framework that does not necessitate prior knowledge of the number of clusters. By leveraging the robustness of MoM alongside the stability of convex clustering, our method enhances both performance and efficiency, especially on large-scale datasets. Theoretical analysis demonstrates weak consistency under specific conditions, while experiments on synthetic and real-world datasets validate the method's superior performance compared to existing approaches.
