Table of Contents
Fetching ...

CoreSPECT: Enhancing Clustering Algorithms via an Interplay of Density and Geometry

Chandra Sekhar Mukherjee, Joonyoung Bae, Jiapeng Zhang

TL;DR

This work identifies a density-geometry correlation in real data and proposes CoreSPECT, a four-step framework that leverages density-layer cores and layer-wise CDNN-based propagation to boost clustering. By extracting dense cores, clustering them, constructing a core-directed nearest-neighbor graph, and expanding labels layer-by-layer, CoreSPECT substantially improves K-Means and HDBSCAN performance on 19 large datasets while remaining computationally efficient. Theoretical guarantees for a CoreSPECT-enabled K-Means variant are provided under the Layered Core-Periphery Density Model (LCPDM), and extensive experiments demonstrate meaningful gains in NMI and ARI across image and genomics domains, with notable speedups on large data. The method approaches or matches state-of-the-art manifold clustering in several cases without requiring priors such as the true number of clusters, highlighting its practical impact for scalable clustering in complex data geometriess.

Abstract

In this paper, we provide a novel perspective on the underlying structure of real-world data with ground-truth clusters via characterization of an abundantly observed yet often overlooked density-geometry correlation, that manifests itself as a multi-layered manifold structure. We leverage this correlation to design CoreSPECT (Core Space Projection based Enhancement of Clustering Techniques), a general framework that improves the performance of generic clustering algorithms. Our framework boosts the performance of clustering algorithms by applying them to strategically selected regions, then extending the partial partition to a complete partition for the dataset using a novel neighborhood graph based multi-layer propagation procedure. We provide initial theoretical support of the functionality of our framework under the assumption of our model, and then provide large-scale real-world experiments on 19 datasets that include standard image datasets as well as genomics datasets. We observe two notable improvements. First, CoreSPECT improves the NMI of K-Means by 20% on average, making it competitive to (and in some cases surpassing) the state-of-the-art manifold-based clustering algorithms, while being orders of magnitude faster. Secondly, our framework boosts the NMI of HDBSCAN by more than 100% on average, making it competitive to the state-of-the-art in several cases without requiring the true number of clusters and hyper-parameter tuning. The overall ARI improvements are higher.

CoreSPECT: Enhancing Clustering Algorithms via an Interplay of Density and Geometry

TL;DR

This work identifies a density-geometry correlation in real data and proposes CoreSPECT, a four-step framework that leverages density-layer cores and layer-wise CDNN-based propagation to boost clustering. By extracting dense cores, clustering them, constructing a core-directed nearest-neighbor graph, and expanding labels layer-by-layer, CoreSPECT substantially improves K-Means and HDBSCAN performance on 19 large datasets while remaining computationally efficient. Theoretical guarantees for a CoreSPECT-enabled K-Means variant are provided under the Layered Core-Periphery Density Model (LCPDM), and extensive experiments demonstrate meaningful gains in NMI and ARI across image and genomics domains, with notable speedups on large data. The method approaches or matches state-of-the-art manifold clustering in several cases without requiring priors such as the true number of clusters, highlighting its practical impact for scalable clustering in complex data geometriess.

Abstract

In this paper, we provide a novel perspective on the underlying structure of real-world data with ground-truth clusters via characterization of an abundantly observed yet often overlooked density-geometry correlation, that manifests itself as a multi-layered manifold structure. We leverage this correlation to design CoreSPECT (Core Space Projection based Enhancement of Clustering Techniques), a general framework that improves the performance of generic clustering algorithms. Our framework boosts the performance of clustering algorithms by applying them to strategically selected regions, then extending the partial partition to a complete partition for the dataset using a novel neighborhood graph based multi-layer propagation procedure. We provide initial theoretical support of the functionality of our framework under the assumption of our model, and then provide large-scale real-world experiments on 19 datasets that include standard image datasets as well as genomics datasets. We observe two notable improvements. First, CoreSPECT improves the NMI of K-Means by 20% on average, making it competitive to (and in some cases surpassing) the state-of-the-art manifold-based clustering algorithms, while being orders of magnitude faster. Secondly, our framework boosts the NMI of HDBSCAN by more than 100% on average, making it competitive to the state-of-the-art in several cases without requiring the true number of clusters and hyper-parameter tuning. The overall ARI improvements are higher.

Paper Structure

This paper contains 45 sections, 17 theorems, 14 equations, 11 figures, 2 tables, 4 algorithms.

Key Result

Theorem 1

Let data be generated from the ${\sf LCPDM}(2,\ell)$ model. Let $\Pi$ be the density of the space. Then, for some $r=\mathcal{O}(1)$, on expectation, all the core points ($\hat{\mathcal{L}}_0$) get a score of $1$. Additionally, all non-core points get a score $<1$.

Figures (11)

  • Figure 1: Increasing dimensionality and Degrading K-Means performance from inner to outer layers in image datasets. The layers are defined as deciles of points based on FlowRank score presented in Algorithm \ref{['alg: FlowRank']}.
  • Figure 4: The CoreSPECT Framework (See Figure \ref{['fig:expo']} for a schematic representation and Algorithm \ref{['alg:CS-Kmeans']} for its application to K-Means)
  • Figure 5: Comparing the accuracy of applying K-Means on the top $x$-fraction of the points (according to FlowRank) vs. applying K-Means to top $10\%$ and then applying Layer-wise expansion (Algorithm \ref{['alg:prop']}) up to top $x$-fraction, for $x \in [0.1, 0.2, \ldots, 1]$ for some pairs of clusters. For $x=0.1$ (the cores), K-Means has very good performance, but as we apply K-Means on the outer layers, its performance deteriorates. In contrast, the layer-wise propagation leads to significantly lower decay, leading to improvement in the overall performance.
  • Figure 6: Improvement of NMI on K-Means and HDBSCAN due to CoreSPECT, compared to best density-peak-based clustering as well as spectral clustering. Impressively, CS-HDBSCAN performs on par (sometimes even being the best) compared to popular algorithms that need the true number of clusters.
  • Figure 7: We first sort the nodes by the FlowRank values and select the nodes that are in the top and bottom 20% which we call the nodes from the inner-most and outer-most layers.Then we find the nearest neighbors using Euclidean distance and the shortest path distance in the CDNN graph generated. We show an empirical evidence that the periphery nodes exhibit more non-linear and the core nodes exihibit more euclidean structure by showing the differences of the nearest neighbor label accuracies.
  • ...and 6 more figures

Theorems & Definitions (32)

  • Definition 1: Layer-preserving ranking
  • Theorem 1: Core-detection by FlowRank
  • Proposition 1
  • Definition 2: Centrally directed nearest neighbor graph (CDNN) $G^{+}_{(t,S)}$
  • Theorem 2
  • Theorem 3: Clustering in the $\sf LCPDM$ model
  • Definition 3: federer1959curvature
  • Proposition : Restated Proposition \ref{['prop:k-means-separate']}
  • proof
  • Lemma 1
  • ...and 22 more