Table of Contents
Fetching ...

Enabling clustering algorithms to detect clusters of varying densities through scale-invariant data preprocessing

Sunil Aryal, Jonathan R. Wells, Arbind Agrahari Baniya, KC Santosh

TL;DR

Clustering results are highly sensitive to feature representation, especially for clusters with varying densities. The paper introduces ARES, a variant of rank transformation that averages ranks over $t$ sub-samples of size $\psi$, producing $\tilde{x}_{ARES}$ that is robust to non-linear scaling. Across seven real-world datasets and three algorithms ($\text{KMeans}$, $\text{DBSCAN}$, $\text{DP}$), ARES preprocessing yields better and more consistent clustering than min-max normalization or plain rank, with DP often showing the strongest gains; in one synthetic case, DP with ARES achieved perfect clustering across representations. This scale-invariant preprocessing offers a practical, broadly applicable approach to robust clustering without exhaustive representation testing.

Abstract

In this paper, we show that preprocessing data using a variant of rank transformation called 'Average Rank over an Ensemble of Sub-samples (ARES)' makes clustering algorithms robust to data representation and enable them to detect varying density clusters. Our empirical results, obtained using three most widely used clustering algorithms-namely KMeans, DBSCAN, and DP (Density Peak)-across a wide range of real-world datasets, show that clustering after ARES transformation produces better and more consistent results.

Enabling clustering algorithms to detect clusters of varying densities through scale-invariant data preprocessing

TL;DR

Clustering results are highly sensitive to feature representation, especially for clusters with varying densities. The paper introduces ARES, a variant of rank transformation that averages ranks over sub-samples of size , producing that is robust to non-linear scaling. Across seven real-world datasets and three algorithms (, , ), ARES preprocessing yields better and more consistent clustering than min-max normalization or plain rank, with DP often showing the strongest gains; in one synthetic case, DP with ARES achieved perfect clustering across representations. This scale-invariant preprocessing offers a practical, broadly applicable approach to robust clustering without exhaustive representation testing.

Abstract

In this paper, we show that preprocessing data using a variant of rank transformation called 'Average Rank over an Ensemble of Sub-samples (ARES)' makes clustering algorithms robust to data representation and enable them to detect varying density clusters. Our empirical results, obtained using three most widely used clustering algorithms-namely KMeans, DBSCAN, and DP (Density Peak)-across a wide range of real-world datasets, show that clustering after ARES transformation produces better and more consistent results.
Paper Structure (7 sections, 3 equations, 3 figures, 3 tables)

This paper contains 7 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Comparison of data distributions: The first row shows the distributions of an example dataset, along with its logarithmic and inverse scaling. The second and third rows depict the corresponding distributions after the Rank and proposed ARES transformations, respectively. In all cases, data are normalised to be in the range of [0, 1] before modelling the density distribution.
  • Figure 2: An illustration of ARES: (a) an example of a given dataset $D$; (b) an ensemble of ranking models $\mathcal{R}_j (j=1,2,\cdots,t)$ from sub-samples $D_j\subset D$ and $|D_j|=\psi=5$. Each $\mathcal{R}_j$ is constructed from $D_j$ by partitioning the real domain into $(\psi+1)$ bins which are then ranked as $(0,1,\cdots,\psi)$ from left to right. The ARES transformation of a query point $x$ on red is computed by aggregating its rank in each $\mathcal{R}_j$.
  • Figure 3: Clustering results of the Jain dataset with $x$, $\log{x}$, and $x^{-1}$ scaling using DP, both with and without ARES transformation. Note that in the case of DP with $x^{-1}$ (Subfig c), only a portion of the plot is shown to provide a clearer view of most points. Due to some extreme values, displaying all points in this section would result in grouping most of them into a single point. Additionally, another small cluster was detected in the unseen part of the plot.