Table of Contents
Fetching ...

Cluster Catch Digraphs with the Nearest Neighbor Distance

Rui Shi, Elvan Ceyhan, Nedret Billor

TL;DR

This paper introduces UN-CCDs, a parameter-free CCD-based clustering method that uses the nearest neighbor distance (NND) within a Monte Carlo Spatial Randomness Test (MC-SRT) to determine covering-ball radii. By replacing Ripley’s K-function with NND in the MC-SRT and adding enhancements such as Holm-corrected tests, descending radius exploration, and an intersection-graph refinement, UN-CCDs improve clustering quality in high-dimensional data. Extensive Monte Carlo simulations and real-data experiments show that UN-CCDs are competitive with KS-CCDs and RK-CCDs, offering especially strong performance in high dimensions while remaining robust to noise. The work highlights a practical, scalable approach for high-dimensional clustering, with clear avenues for future extensions (overlapping clusters, semi-supervised settings, and automated tuning).

Abstract

We introduce a new method for clustering based on Cluster Catch Digraphs (CCDs). The new method addresses the limitations of RK-CCDs by employing a new variant of spatial randomness test that employs the nearest neighbor distance (NND) instead of the Ripley's K function used by RK-CCDs. We conduct a comprehensive Monte Carlo analysis to assess the performance of our method, considering factors such as dimensionality, data set size, number of clusters, cluster volumes, and inter-cluster distance. Our method is particularly effective for high-dimensional data sets, comparable to or outperforming KS-CCDs and RK-CCDs that rely on a KS-type statistic or the Ripley's K function. We also evaluate our methods using real and complex data sets, comparing them to well-known clustering methods. Again, our methods exhibit competitive performance, producing high-quality clusters with desirable properties. Keywords: Graph-based clustering, Cluster catch digraphs, High-dimensional data, The nearest neighbor distance, Spatial randomness test

Cluster Catch Digraphs with the Nearest Neighbor Distance

TL;DR

This paper introduces UN-CCDs, a parameter-free CCD-based clustering method that uses the nearest neighbor distance (NND) within a Monte Carlo Spatial Randomness Test (MC-SRT) to determine covering-ball radii. By replacing Ripley’s K-function with NND in the MC-SRT and adding enhancements such as Holm-corrected tests, descending radius exploration, and an intersection-graph refinement, UN-CCDs improve clustering quality in high-dimensional data. Extensive Monte Carlo simulations and real-data experiments show that UN-CCDs are competitive with KS-CCDs and RK-CCDs, offering especially strong performance in high dimensions while remaining robust to noise. The work highlights a practical, scalable approach for high-dimensional clustering, with clear avenues for future extensions (overlapping clusters, semi-supervised settings, and automated tuning).

Abstract

We introduce a new method for clustering based on Cluster Catch Digraphs (CCDs). The new method addresses the limitations of RK-CCDs by employing a new variant of spatial randomness test that employs the nearest neighbor distance (NND) instead of the Ripley's K function used by RK-CCDs. We conduct a comprehensive Monte Carlo analysis to assess the performance of our method, considering factors such as dimensionality, data set size, number of clusters, cluster volumes, and inter-cluster distance. Our method is particularly effective for high-dimensional data sets, comparable to or outperforming KS-CCDs and RK-CCDs that rely on a KS-type statistic or the Ripley's K function. We also evaluate our methods using real and complex data sets, comparing them to well-known clustering methods. Again, our methods exhibit competitive performance, producing high-quality clusters with desirable properties. Keywords: Graph-based clustering, Cluster catch digraphs, High-dimensional data, The nearest neighbor distance, Spatial randomness test
Paper Structure (25 sections, 4 equations, 11 figures, 12 tables, 3 algorithms)

This paper contains 25 sections, 4 equations, 11 figures, 12 tables, 3 algorithms.

Figures (11)

  • Figure 1: A illustration of clustering with UN-CCDs. Top-left: A dataset consisting of 5 clusters generated from 5 different bivariate normal distributions. Top-right: The covering balls of an approximate MDS obtained by Greedy Algorithm \ref{['alg:greedy-outdegree-orig_digraph']}. Bottom-left: The covering balls of an approximate MDS of the intersection graph. Bottom-right: The dominating covering balls of the intersection graph that maximize silhouette index $Sil(P)$.
  • Figure 2: Realizations of the simulation settings with 2, 3, and 5 uniform clusters in $\mathbb{R}^2$.
  • Figure 3: The line plots of the ARIs of KS-CCDs, under the uniform cluster settings.
  • Figure 4: The line plots of the ARIs of RK-CCDs, under the uniform cluster settings.
  • Figure 5: The line plots of the ARIs of UN-CCDs, under the uniform cluster settings.
  • ...and 6 more figures