Table of Contents
Fetching ...

Hierarchical clustering with dot products recovers hidden tree structure

Annie Gray, Alexander Modell, Patrick Rubin-Delanchy, Nick Whiteley

TL;DR

A new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure is offered, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance.

Abstract

In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model. The key technical innovations are to understand how hierarchical information in this model translates into tree geometry which can be recovered from data, and to characterise the benefits of simultaneously growing sample size and data dimension. We demonstrate superior tree recovery performance with real data over existing approaches such as UPGMA, Ward's method, and HDBSCAN.

Hierarchical clustering with dot products recovers hidden tree structure

TL;DR

A new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure is offered, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance.

Abstract

In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model. The key technical innovations are to understand how hierarchical information in this model translates into tree geometry which can be recovered from data, and to characterise the benefits of simultaneously growing sample size and data dimension. We demonstrate superior tree recovery performance with real data over existing approaches such as UPGMA, Ward's method, and HDBSCAN.
Paper Structure (36 sections, 9 theorems, 69 equations, 5 figures, 4 tables, 2 algorithms)

This paper contains 36 sections, 9 theorems, 69 equations, 5 figures, 4 tables, 2 algorithms.

Key Result

Lemma 1

For any two vertices $u,v \in \mathcal{V}$,

Figures (5)

  • Figure 1: (a) An example of the dendrogram $\mathcal{D}$ with $\mathcal{V}=\{a,b,c,d,e,f,g\}$, see lemma \ref{['lem:merge_height']} for interpretation of the merge height $m(\cdot,\cdot)$ and distance $d(\cdot,\cdot)$. Horizontal distances in this diagram are chosen arbitrarily. (b) $\mathcal{D}$ augmented according to $\mathcal{Z}=\{a,b,c,d\}$ and the realization: $Z_1,Z_2,Z_3=a$, $Z_4=b$, $Z_5=Z_6=d$, see section \ref{['sec:putting_together']} for discussion. (c) The dendrogram $\hat{\mathcal{D}}$ output from algorithm \ref{['alg:ip_hc']} in the case $\hat{\alpha}=\hat{\alpha}_{\mathrm{data}}$. $\hat{\mathcal{D}}$ can be seen to approximate the dendrogram in (b) and hence $\mathcal{D}$.
  • Figure 2: Simulation study of $\hat{\alpha}_{\text{data}}$ and $\hat{\alpha}_{\text{pca}}$ as estimators of $\alpha$. All three subplots display the maximum error, $\max_{i,j \in [n], i \neq j}|\alpha(Z_i,Z_j)-\hat{\alpha}(i,j)|$, for $\hat{\alpha}=\hat{\alpha}_{\text{data}}$ (black in all subplots (a)-(c)) and $\hat{\alpha}=\hat{\alpha}_{\text{pca}}$ (red). Error bars showing the standard deviation from $100$ simulations are present for all data points, but in some cases are so small they are barely visible.
  • Figure 3: Analysis of the 20 Newsgroups data. Marker shapes correspond to newsgroup classes and marker colours correspond to topics within classes. The first/second columns show results for dot products/Euclidean distances respectively. First row: for each topic ($x$-axis), the affinity/distance ($y$-axis) to the top five best-matching topics, calculated using average linkage of PC scores between documents within topics. Second row: average affinity/distance between documents labelled 'comp.windows.x' and all other topics. Third row: dendrograms output from algorithm \ref{['alg:ip_hc']} and UPGMA applied to cluster topics.
  • Figure 4: Performance of Algorithm 1 for the 20 Newsgroups data set as a function of number of TF-IDF features, $p$, with $n$ fixed. See table \ref{['tbl:results']} for numerical values and standard errors.
  • Figure 5: Illustration of how maximising merge height $m(\cdot,\cdot)$ may or may not be equivalent to minimising distance $d(\cdot,\cdot)$, depending on the geometry of the dendrogram. (a) Equivalence holds (b) Equivalence does not hold.

Theorems & Definitions (18)

  • Lemma 1
  • Theorem 1
  • Theorem 2
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • proof : Proof of lemma \ref{['lem:merge_height']}
  • Lemma 4
  • proof
  • ...and 8 more