Enabling clustering algorithms to detect clusters of varying densities through scale-invariant data preprocessing

Sunil Aryal; Jonathan R. Wells; Arbind Agrahari Baniya; KC Santosh

Enabling clustering algorithms to detect clusters of varying densities through scale-invariant data preprocessing

Sunil Aryal, Jonathan R. Wells, Arbind Agrahari Baniya, KC Santosh

TL;DR

Clustering results are highly sensitive to feature representation, especially for clusters with varying densities. The paper introduces ARES, a variant of rank transformation that averages ranks over $t$ sub-samples of size $\psi$, producing $\tilde{x}_{ARES}$ that is robust to non-linear scaling. Across seven real-world datasets and three algorithms ($\text{KMeans}$, $\text{DBSCAN}$, $\text{DP}$), ARES preprocessing yields better and more consistent clustering than min-max normalization or plain rank, with DP often showing the strongest gains; in one synthetic case, DP with ARES achieved perfect clustering across representations. This scale-invariant preprocessing offers a practical, broadly applicable approach to robust clustering without exhaustive representation testing.

Abstract

In this paper, we show that preprocessing data using a variant of rank transformation called 'Average Rank over an Ensemble of Sub-samples (ARES)' makes clustering algorithms robust to data representation and enable them to detect varying density clusters. Our empirical results, obtained using three most widely used clustering algorithms-namely KMeans, DBSCAN, and DP (Density Peak)-across a wide range of real-world datasets, show that clustering after ARES transformation produces better and more consistent results.

Enabling clustering algorithms to detect clusters of varying densities through scale-invariant data preprocessing

TL;DR

Clustering results are highly sensitive to feature representation, especially for clusters with varying densities. The paper introduces ARES, a variant of rank transformation that averages ranks over

sub-samples of size

, producing

that is robust to non-linear scaling. Across seven real-world datasets and three algorithms (

), ARES preprocessing yields better and more consistent clustering than min-max normalization or plain rank, with DP often showing the strongest gains; in one synthetic case, DP with ARES achieved perfect clustering across representations. This scale-invariant preprocessing offers a practical, broadly applicable approach to robust clustering without exhaustive representation testing.

Abstract

Paper Structure (7 sections, 3 equations, 3 figures, 3 tables)

This paper contains 7 sections, 3 equations, 3 figures, 3 tables.

Introduction
ARES Transformation
Empirical Evaluation
Experimental setup
Results in the given representation of data
Results with different representations of data
Conclusions and Future Work

Figures (3)

Figure 1: Comparison of data distributions: The first row shows the distributions of an example dataset, along with its logarithmic and inverse scaling. The second and third rows depict the corresponding distributions after the Rank and proposed ARES transformations, respectively. In all cases, data are normalised to be in the range of [0, 1] before modelling the density distribution.
Figure 2: An illustration of ARES: (a) an example of a given dataset $D$; (b) an ensemble of ranking models $\mathcal{R}_j (j=1,2,\cdots,t)$ from sub-samples $D_j\subset D$ and $|D_j|=\psi=5$. Each $\mathcal{R}_j$ is constructed from $D_j$ by partitioning the real domain into $(\psi+1)$ bins which are then ranked as $(0,1,\cdots,\psi)$ from left to right. The ARES transformation of a query point $x$ on red is computed by aggregating its rank in each $\mathcal{R}_j$.
Figure 3: Clustering results of the Jain dataset with $x$, $\log{x}$, and $x^{-1}$ scaling using DP, both with and without ARES transformation. Note that in the case of DP with $x^{-1}$ (Subfig c), only a portion of the plot is shown to provide a clearer view of most points. Due to some extreme values, displaying all points in this section would result in grouping most of them into a single point. Additionally, another small cluster was detected in the unseen part of the plot.

Enabling clustering algorithms to detect clusters of varying densities through scale-invariant data preprocessing

TL;DR

Abstract

Enabling clustering algorithms to detect clusters of varying densities through scale-invariant data preprocessing

Authors

TL;DR

Abstract

Table of Contents

Figures (3)