Table of Contents
Fetching ...

The Impact of Isolation Kernel on Agglomerative Hierarchical Clustering Algorithms

Xin Han, Ye Zhu, Kai Ming Ting, Gang Li

TL;DR

The paper tackles the challenge of varied-density clusters hindering distance-based agglomerative hierarchical clustering (AHC). It proposes a generic kernel-based approach that replaces the distance with a data-dependent kernel, specifically Isolation Kernel (IK), to produce purer dendrograms. It formalizes a condition for successful cluster extraction and introduces entanglement to describe cross-cluster merges, showing IK reduces entanglements and density bias. Empirically, IK improves dendrogram purity across four algorithms (T-AHC, HDBSCAN, GDL, PHA) and outperforms Gaussian and Adaptive Gaussian kernels, underscoring IK's broad applicability for hierarchical clustering with varied-density data.

Abstract

Agglomerative hierarchical clustering (AHC) is one of the popular clustering approaches. Existing AHC methods, which are based on a distance measure, have one key issue: it has difficulty in identifying adjacent clusters with varied densities, regardless of the cluster extraction methods applied on the resultant dendrogram. In this paper, we identify the root cause of this issue and show that the use of a data-dependent kernel (instead of distance or existing kernel) provides an effective means to address it. We analyse the condition under which existing AHC methods fail to extract clusters effectively; and the reason why the data-dependent kernel is an effective remedy. This leads to a new approach to kernerlise existing hierarchical clustering algorithms such as existing traditional AHC algorithms, HDBSCAN, GDL and PHA. In each of these algorithms, our empirical evaluation shows that a recently introduced Isolation Kernel produces a higher quality or purer dendrogram than distance, Gaussian Kernel and adaptive Gaussian Kernel.

The Impact of Isolation Kernel on Agglomerative Hierarchical Clustering Algorithms

TL;DR

The paper tackles the challenge of varied-density clusters hindering distance-based agglomerative hierarchical clustering (AHC). It proposes a generic kernel-based approach that replaces the distance with a data-dependent kernel, specifically Isolation Kernel (IK), to produce purer dendrograms. It formalizes a condition for successful cluster extraction and introduces entanglement to describe cross-cluster merges, showing IK reduces entanglements and density bias. Empirically, IK improves dendrogram purity across four algorithms (T-AHC, HDBSCAN, GDL, PHA) and outperforms Gaussian and Adaptive Gaussian kernels, underscoring IK's broad applicability for hierarchical clustering with varied-density data.

Abstract

Agglomerative hierarchical clustering (AHC) is one of the popular clustering approaches. Existing AHC methods, which are based on a distance measure, have one key issue: it has difficulty in identifying adjacent clusters with varied densities, regardless of the cluster extraction methods applied on the resultant dendrogram. In this paper, we identify the root cause of this issue and show that the use of a data-dependent kernel (instead of distance or existing kernel) provides an effective means to address it. We analyse the condition under which existing AHC methods fail to extract clusters effectively; and the reason why the data-dependent kernel is an effective remedy. This leads to a new approach to kernerlise existing hierarchical clustering algorithms such as existing traditional AHC algorithms, HDBSCAN, GDL and PHA. In each of these algorithms, our empirical evaluation shows that a recently introduced Isolation Kernel produces a higher quality or purer dendrogram than distance, Gaussian Kernel and adaptive Gaussian Kernel.

Paper Structure

This paper contains 20 sections, 2 theorems, 18 equations, 5 figures, 9 tables, 1 algorithm.

Key Result

Theorem 1

Given two non-overlapping ground-truth clusters $\zeta_i$ and $\zeta_j$ in a dataset, to correctly identify them from the dendrogram produced by the agglomerative clustering algorithm with a kernel linkage function $\hslash$, both clusters must satisfy the following condition: where $\mathbbm{C}_\imath^i$ is the set of subclusters at step $\imath$ of the process in merging members in $\zeta_i$; s

Figures (5)

  • Figure 1: A dendrogram produced by T-AHC with the distance-based single-linkage function on a dataset with four clusters of varied densities. The colours at the bottom of the dendrogram correspond to the true cluster labels of all points shown in Figure (a). The arrow in Figure (b) indicates the subtrees containing points from different clusters.
  • Figure 2: MDS plot using Gaussian Kernel and Isolation Kernel on the dataset in (a).
  • Figure 3: An example dataset of an image
  • Figure 4: Critical difference (CD) diagram of the post-hoc Nemenyi test ($\alpha=0.10$) for dendrogram purity. Two measures are not significantly different if there is a line linking them.
  • Figure 5: Critical difference (CD) diagram of the post-hoc Nemenyi test ($\alpha=0.10$) for $F1$ scores. Two measures are not significantly different if there is a line linking them.

Theorems & Definitions (7)

  • Definition 2
  • Definition 3
  • Definition 4
  • Theorem 1
  • proof
  • Corollary 1
  • Definition 5