Table of Contents
Fetching ...

Hierarchical Clustering With Confidence

Di Wu, Jacob Bien, Snigdha Panigrahi

Abstract

Agglomerative hierarchical clustering is one of the most widely used approaches for exploring how observations in a dataset relate to each other. However, its greedy nature makes it highly sensitive to small perturbations in the data, often producing different clustering results and making it difficult to separate genuine structure from spurious patterns. In this paper, we show how randomizing hierarchical clustering can be useful not just for measuring stability but also for designing valid hypothesis testing procedures based on the clustering results. We propose a simple randomization scheme together with a method for constructing a valid p-value at each node of the hierarchical clustering dendrogram that quantifies evidence against performing the greedy merge. Our test controls the Type I error rate, works with any hierarchical linkage without case-specific derivations, and simulations show it is substantially more powerful than existing selective inference approaches. To demonstrate the practical utility of our p-values, we develop an adaptive $α$-spending procedure that estimates the number of clusters, with a probabilistic guarantee on overestimation. Experiments on simulated and real data show that this estimate yields powerful clustering and can be used, for example, to assess clustering stability across multiple runs of the randomized algorithm.

Hierarchical Clustering With Confidence

Abstract

Agglomerative hierarchical clustering is one of the most widely used approaches for exploring how observations in a dataset relate to each other. However, its greedy nature makes it highly sensitive to small perturbations in the data, often producing different clustering results and making it difficult to separate genuine structure from spurious patterns. In this paper, we show how randomizing hierarchical clustering can be useful not just for measuring stability but also for designing valid hypothesis testing procedures based on the clustering results. We propose a simple randomization scheme together with a method for constructing a valid p-value at each node of the hierarchical clustering dendrogram that quantifies evidence against performing the greedy merge. Our test controls the Type I error rate, works with any hierarchical linkage without case-specific derivations, and simulations show it is substantially more powerful than existing selective inference approaches. To demonstrate the practical utility of our p-values, we develop an adaptive -spending procedure that estimates the number of clusters, with a probabilistic guarantee on overestimation. Experiments on simulated and real data show that this estimate yields powerful clustering and can be used, for example, to assess clustering stability across multiple runs of the randomized algorithm.

Paper Structure

This paper contains 39 sections, 16 theorems, 81 equations, 11 figures, 1 table, 2 algorithms.

Key Result

Proposition 1

Let $\Omega_o^{*(t)}$ be the set of possible merge sequences from running traditional hierarchical clustering eq:trad_hclust for $t$ steps on a fixed data matrix $X_o$, i.e. Then, the randomized algorithm's random merge sequence $\overline{M}^{(t)}$ when presented with the same data satisfies the following:

Figures (11)

  • Figure 1: (a): A two-dimensional example with $n=30$ points across two true clusters; (b): a dendrogram resulting from complete-linkage hierarchical clustering of the example data; (c): a dendrogram resulting from complete-linkage randomized hierarchical clustering of the example data (using a randomization parameter $\tau^* = 0.10$, defined later in Section \ref{['sec:algorithm']}); (d): A histogram for the estimated number of clusters $\hat{K}$ based on the proposed $\alpha$-spending procedure.
  • Figure 2: Comparison of clustering quality metrics under varying levels of randomization $\tau^*$. (Left): Boxplots of the ratio between within-cluster sum of squares (WCSS) and total sum of squares (TSS), showing how cluster compactness changes with $\tau^*$. (Right): Boxplots of the Adjusted Rand Index (ARI), measuring agreement with the true clustering, which declines as $\tau^*$ increases.
  • Figure 3: Comparison of the p-value ECDFs and Type I error rates simulated under the null hypothesis. (Left) ECDF plots for the proposed and baseline methods. (Right) Boxplots of the Type I error rates across different methods.
  • Figure 4: Empirical power curves as a function of effect size for the proposed randomized method with $\tau^* = 0.10$, compared with the two selective inference approaches under varying choice of linkage functions (complete, single, average and minimax) and true number of clusters $K=2$.
  • Figure 5: Paired histograms of the $\widehat{K}$ values selected by our proposed method and the gap statistic across varying values of $\delta$. As $\delta$ increases, our method consistently recovers the true number of clusters, while the gap statistic remains overly conservative, estimating $\widehat{K}=1$.
  • ...and 6 more figures

Theorems & Definitions (32)

  • Proposition 1
  • Lemma 1
  • Lemma 2: Based on yun2023selective
  • Theorem 1
  • Corollary 1
  • proof
  • Theorem 2
  • Proposition 2
  • Theorem 3
  • Lemma 3
  • ...and 22 more