Enabling clustering algorithms to detect clusters of varying densities through scale-invariant data preprocessing
Sunil Aryal, Jonathan R. Wells, Arbind Agrahari Baniya, KC Santosh
TL;DR
Clustering results are highly sensitive to feature representation, especially for clusters with varying densities. The paper introduces ARES, a variant of rank transformation that averages ranks over $t$ sub-samples of size $\psi$, producing $\tilde{x}_{ARES}$ that is robust to non-linear scaling. Across seven real-world datasets and three algorithms ($\text{KMeans}$, $\text{DBSCAN}$, $\text{DP}$), ARES preprocessing yields better and more consistent clustering than min-max normalization or plain rank, with DP often showing the strongest gains; in one synthetic case, DP with ARES achieved perfect clustering across representations. This scale-invariant preprocessing offers a practical, broadly applicable approach to robust clustering without exhaustive representation testing.
Abstract
In this paper, we show that preprocessing data using a variant of rank transformation called 'Average Rank over an Ensemble of Sub-samples (ARES)' makes clustering algorithms robust to data representation and enable them to detect varying density clusters. Our empirical results, obtained using three most widely used clustering algorithms-namely KMeans, DBSCAN, and DP (Density Peak)-across a wide range of real-world datasets, show that clustering after ARES transformation produces better and more consistent results.
