Table of Contents
Fetching ...

Hierarchical clustering that takes advantage of both density-peak and density-connectivity

Ye Zhu, Kai Ming Ting, Yuan Jin, Maia Angelova

TL;DR

This work formalizes two cluster notions, $η$-linked and $η$-density-connected$ clusters, to analyze and extend Density Peak (DP) clustering. It shows DP targets $η$-linked clusters but has two fundamental weaknesses, which are not resolved by Local Contrast; to address this, the authors introduce DC-HDP, a density-connected hierarchical clustering that merges cluster modes only when they are connected by an $η$-density-connected path, preserving DP's efficiency while enabling arbitrary shapes and highly varied densities. DC-HDP yields a dendrogram, providing richer hierarchical cluster information and a principled way to extract flat clusters at desired granularity. Empirically, DC-HDP outperforms a broad set of state-of-the-art clustering algorithms (density-based, hierarchical, and graph-based) across 28 datasets, with a macro F-measure average of 0.82 and competitive runtimes. The approach offers a rigorous foundation for hierarchical density-based clustering and practical gains in cluster discovery and interpretation.

Abstract

This paper focuses on density-based clustering, particularly the Density Peak (DP) algorithm and the one based on density-connectivity DBSCAN; and proposes a new method which takes advantage of the individual strengths of these two methods to yield a density-based hierarchical clustering algorithm. Our investigation begins with formally defining the types of clusters DP and DBSCAN are designed to detect; and then identifies the kinds of distributions that DP and DBSCAN individually fail to detect all clusters in a dataset. These identified weaknesses inspire us to formally define a new kind of clusters and propose a new method called DC-HDP to overcome these weaknesses to identify clusters with arbitrary shapes and varied densities. In addition, the new method produces a richer clustering result in terms of hierarchy or dendrogram for better cluster structures understanding. Our empirical evaluation results show that DC-HDP produces the best clustering results on 14 datasets in comparison with 7 state-of-the-art clustering algorithms.

Hierarchical clustering that takes advantage of both density-peak and density-connectivity

TL;DR

This work formalizes two cluster notions, -linked and -density-connectedηη$-density-connected path, preserving DP's efficiency while enabling arbitrary shapes and highly varied densities. DC-HDP yields a dendrogram, providing richer hierarchical cluster information and a principled way to extract flat clusters at desired granularity. Empirically, DC-HDP outperforms a broad set of state-of-the-art clustering algorithms (density-based, hierarchical, and graph-based) across 28 datasets, with a macro F-measure average of 0.82 and competitive runtimes. The approach offers a rigorous foundation for hierarchical density-based clustering and practical gains in cluster discovery and interpretation.

Abstract

This paper focuses on density-based clustering, particularly the Density Peak (DP) algorithm and the one based on density-connectivity DBSCAN; and proposes a new method which takes advantage of the individual strengths of these two methods to yield a density-based hierarchical clustering algorithm. Our investigation begins with formally defining the types of clusters DP and DBSCAN are designed to detect; and then identifies the kinds of distributions that DP and DBSCAN individually fail to detect all clusters in a dataset. These identified weaknesses inspire us to formally define a new kind of clusters and propose a new method called DC-HDP to overcome these weaknesses to identify clusters with arbitrary shapes and varied densities. In addition, the new method produces a richer clustering result in terms of hierarchy or dendrogram for better cluster structures understanding. Our empirical evaluation results show that DC-HDP produces the best clustering results on 14 datasets in comparison with 7 state-of-the-art clustering algorithms.

Paper Structure

This paper contains 17 sections, 4 theorems, 8 equations, 7 figures, 5 tables, 1 algorithm.

Key Result

Lemma 1

Every point in $D_{\ominus}$ has a path to $\hat{m}$ when the dataset only has one mode, i.e., $\forall_{x\in D_{\ominus}} \ 1 < Lpath(x, \hat{m}) \leqslant n$.

Figures (7)

  • Figure 1: Clustering results of different algorithms on a dataset having clusters with varied densities. "-1" indicates the noise assigned by the algorithm with light yellow colour. Note that in Figure (b), only the detected sparse cluster with the most assigned points is labelled as cluster 3.
  • Figure 2: Density distribution of four one-dimensional datasets. Each dataset has two clusters with different densities. (a) only contains $\eta$-linked clusters but (b), (c) and (d) contain non-$\eta$-linked cluster clusters. In each cluster of (a), the density of instance is strictly decreasing from the cluster mode on both left and right sides. In (c) and (d), the green and orange dash-lines point to the nearest neighbour with higher density of Peak 3 from left side and right side, respectively. The red dash-line indicates the boundary of the two clusters identified by DP.
  • Figure 5: Density distribution of four one-dimensional datasets illustrating $\eta$-density-connected clusters. Parameter $\tau$ is the density threshold used for density-connectivity check. The example in (d) shows that all points which are $\eta$-density-connected to Peak 1 are not density-connected with any points which are $\eta$-density-connected to Peak 3. The path between Peak 1 and Peak 3 is disconnected by the $\tau$ setting shown in (d). In (c) and (d), the orange dash-lines point to the nearest density-connected neigbour with higher density of Peak 3. The red dash-line indicates the boundary of the two clusters.
  • Figure 6: The hierarchical DP clustering results on three datasets: they are the same as those produced by the flat DP shown in Figure \ref{['fig01']} and Figure \ref{['fig1']}. The horizontal line denotes the $\gamma$ threshold and all points below that line are grouped as a cluster. The colours on the dendrogram branch correspond to the clustering results. The colours at the bottom row in each dendrogram correspond to the true cluster labels of all points shown in Figures (b), (d) and (f).
  • Figure 7: DC-HDP's hierarchical clustering results on the three datasets. Setting $\epsilon=0.1$ and $\tau=1$ (with $k$ set to the true number of clusters) produces the perfect results on all three datasets. The colours used in each dendrogram (on the left) are the same as used in the corresponding scatter plot (on the right).
  • ...and 2 more figures

Theorems & Definitions (16)

  • Definition 1
  • Definition 2
  • Definition 3
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Theorem 1
  • proof
  • Definition 4
  • ...and 6 more