Table of Contents
Fetching ...

CDF Transform-and-Shift: An effective way to deal with datasets of inhomogeneous cluster densities

Ye Zhu, Kai Ming Ting, Mark Carman, Maia Angelova

TL;DR

The paper tackles the problem of biased clustering and anomaly detection in datasets with inhomogeneous cluster densities. It introduces CDF Transform-and-Shift (CDF-TS), a multi-dimensional CDF-based preprocessing that homogenises local densities while preserving cluster structure, enabling existing algorithms to operate under their implicit assumptions without modification. Through extensive experiments on clustering and $k$NN anomaly detection, CDF-TS consistently improves performance over state-of-the-art remedies like ReScale and DScale, and is compatible with multiple density estimators. The method provides a practical, general approach to mitigating density bias and can extend to other density-based techniques beyond those tested.

Abstract

The problem of inhomogeneous cluster densities has been a long-standing issue for distance-based and density-based algorithms in clustering and anomaly detection. These algorithms implicitly assume that all clusters have approximately the same density. As a result, they often exhibit a bias towards dense clusters in the presence of sparse clusters. Many remedies have been suggested; yet, we show that they are partial solutions which do not address the issue satisfactorily. To match the implicit assumption, we propose to transform a given dataset such that the transformed clusters have approximately the same density while all regions of locally low density become globally low density -- homogenising cluster density while preserving the cluster structure of the dataset. We show that this can be achieved by using a new multi-dimensional Cumulative Distribution Function in a transform-and-shift method. The method can be applied to every dataset, before the dataset is used in many existing algorithms to match their implicit assumption without algorithmic modification. We show that the proposed method performs better than existing remedies.

CDF Transform-and-Shift: An effective way to deal with datasets of inhomogeneous cluster densities

TL;DR

The paper tackles the problem of biased clustering and anomaly detection in datasets with inhomogeneous cluster densities. It introduces CDF Transform-and-Shift (CDF-TS), a multi-dimensional CDF-based preprocessing that homogenises local densities while preserving cluster structure, enabling existing algorithms to operate under their implicit assumptions without modification. Through extensive experiments on clustering and NN anomaly detection, CDF-TS consistently improves performance over state-of-the-art remedies like ReScale and DScale, and is compatible with multiple density estimators. The method provides a practical, general approach to mitigating density bias and can extend to other density-based techniques beyond those tested.

Abstract

The problem of inhomogeneous cluster densities has been a long-standing issue for distance-based and density-based algorithms in clustering and anomaly detection. These algorithms implicitly assume that all clusters have approximately the same density. As a result, they often exhibit a bias towards dense clusters in the presence of sparse clusters. Many remedies have been suggested; yet, we show that they are partial solutions which do not address the issue satisfactorily. To match the implicit assumption, we propose to transform a given dataset such that the transformed clusters have approximately the same density while all regions of locally low density become globally low density -- homogenising cluster density while preserving the cluster structure of the dataset. We show that this can be achieved by using a new multi-dimensional Cumulative Distribution Function in a transform-and-shift method. The method can be applied to every dataset, before the dataset is used in many existing algorithms to match their implicit assumption without algorithmic modification. We show that the proposed method performs better than existing remedies.

Paper Structure

This paper contains 24 sections, 1 theorem, 30 equations, 12 figures, 14 tables, 2 algorithms.

Key Result

Lemma 1

For any data distribution and sufficiently small values of $\gamma$ and $\lambda$ s.t. $\gamma < \lambda$, if $x$ is at a local maximum density of $\mathcal{N}(x;\lambda)$, then ${rpdf}(x;\gamma,\lambda)\geqslant 1$; and if $x$ is at a local minimum density of $\mathcal{N}(x;\lambda)$, then ${rpdf}(

Figures (12)

  • Figure 1: An image used for segmentation.
  • Figure 2: (a) A mixture of three one-dimensional Gaussian distributions that cannot be separated using a single density threshold; (b) Density distribution on ReScaled data of (a), where a single density threshold can be found to separate all three clusters. Note that point $x_1$, $x_2$ and $x_3$ are shifted to $y_1$, $y_2$ and $y_3$, respectively. $\eta$ is a larger bandwidth than $\epsilon$.
  • Figure 3: (a) A mixture of two one-dimensional Gaussian distributions where C is a normal cluster and A is an anomalous cluster; (b) Density distribution on the ReScaled data of (a), where the anomalous cluster is farther to the normal cluster centre. Note that point $x_1$, $x_2$ and $x_3$ are shifted to $y_1$, $y_2$ and $y_3$, respectively.
  • Figure 4: Scatter plots for illustrating the effects of CDF-TS on a two-dimensional data with $\lambda=0.1$. $Std(x)=\sigma( \widehat{pdf}_{\epsilon}(x\in D))$ represents the standard deviation of the density with $\epsilon=0.1$.
  • Figure 5: (a) The scatter plot of a two-dimensional data containing three elongated clusters. (b) The scatter plot of a two-dimensional data containing four clusters.
  • ...and 7 more figures

Theorems & Definitions (5)

  • Definition 1
  • Definition 2
  • Definition 3
  • Lemma 1
  • proof