Table of Contents
Fetching ...

Chamfer-Linkage for Hierarchical Agglomerative Clustering

Kishen N Gowda, Willem Fletcher, MohammadHossein Bateni, Laxman Dhulipala, D Ellis Hershkowitz, Rajesh Jayaram, Jakub Łącki

TL;DR

Chamfer-linkage addresses HAC by introducing an asymmetric cluster distance $Ch(A,B)=sum_{a in A} min_{b in B} d(a,b)$, which better captures concept representation and promotes balanced dendrograms. The authors show an exact $O(n^2)$-time HAC algorithm within the HAC-NN framework and provide a space–time trade-off that reduces memory to $O(n^2/t)$ at the cost of $O(n^2 t)$ time, with $t\in[1,n]$. Empirically, Chamfer-linkage beats classical linkages (e.g., average, Ward) on 19 diverse datasets in ARI and NMI while producing well-balanced hierarchies, and the authors release optimized C++ implementations and Python bindings. The work demonstrates that Chamfer-linkage is a practical drop-in replacement for traditional linkages and motivates scalable parallel and approximate variants for very large datasets.

Abstract

Hierarchical Agglomerative Clustering (HAC) is a widely-used clustering method based on repeatedly merging the closest pair of clusters, where inter-cluster distances are determined by a linkage function. Unlike many clustering methods, HAC does not optimize a single explicit global objective; clustering quality is therefore primarily evaluated empirically, and the choice of linkage function plays a crucial role in practice. However, popular classical linkages, such as single-linkage, average-linkage and Ward's method show high variability across real-world datasets and do not consistently produce high-quality clusterings in practice. In this paper, we propose \emph{Chamfer-linkage}, a novel linkage function that measures the distance between clusters using the Chamfer distance, a popular notion of distance between point-clouds in machine learning and computer vision. We argue that Chamfer-linkage satisfies desirable concept representation properties that other popular measures struggle to satisfy. Theoretically, we show that Chamfer-linkage HAC can be implemented in $O(n^2)$ time, matching the efficiency of classical linkage functions. Experimentally, we find that Chamfer-linkage consistently yields higher-quality clusterings than classical linkages such as average-linkage and Ward's method across a diverse collection of datasets. Our results establish Chamfer-linkage as a practical drop-in replacement for classical linkage functions, broadening the toolkit for hierarchical clustering in both theory and practice.

Chamfer-Linkage for Hierarchical Agglomerative Clustering

TL;DR

Chamfer-linkage addresses HAC by introducing an asymmetric cluster distance , which better captures concept representation and promotes balanced dendrograms. The authors show an exact -time HAC algorithm within the HAC-NN framework and provide a space–time trade-off that reduces memory to at the cost of time, with . Empirically, Chamfer-linkage beats classical linkages (e.g., average, Ward) on 19 diverse datasets in ARI and NMI while producing well-balanced hierarchies, and the authors release optimized C++ implementations and Python bindings. The work demonstrates that Chamfer-linkage is a practical drop-in replacement for traditional linkages and motivates scalable parallel and approximate variants for very large datasets.

Abstract

Hierarchical Agglomerative Clustering (HAC) is a widely-used clustering method based on repeatedly merging the closest pair of clusters, where inter-cluster distances are determined by a linkage function. Unlike many clustering methods, HAC does not optimize a single explicit global objective; clustering quality is therefore primarily evaluated empirically, and the choice of linkage function plays a crucial role in practice. However, popular classical linkages, such as single-linkage, average-linkage and Ward's method show high variability across real-world datasets and do not consistently produce high-quality clusterings in practice. In this paper, we propose \emph{Chamfer-linkage}, a novel linkage function that measures the distance between clusters using the Chamfer distance, a popular notion of distance between point-clouds in machine learning and computer vision. We argue that Chamfer-linkage satisfies desirable concept representation properties that other popular measures struggle to satisfy. Theoretically, we show that Chamfer-linkage HAC can be implemented in time, matching the efficiency of classical linkage functions. Experimentally, we find that Chamfer-linkage consistently yields higher-quality clusterings than classical linkages such as average-linkage and Ward's method across a diverse collection of datasets. Our results establish Chamfer-linkage as a practical drop-in replacement for classical linkage functions, broadening the toolkit for hierarchical clustering in both theory and practice.
Paper Structure (14 sections, 3 theorems, 7 equations, 3 figures, 7 tables, 2 algorithms)

This paper contains 14 sections, 3 theorems, 7 equations, 3 figures, 7 tables, 2 algorithms.

Key Result

Theorem 1

Algorithm alg:HAC_NN, when instantiated with the Chamfer-linkage-specific Merge subroutine in Algorithm alg:Chamfer_HAC, correctly computes the Chamfer-linkage dendrogram in $O(n^2)$ time and space.

Figures (3)

  • Figure 1: An example of HAC.
  • Figure 2: Non-reducibility: Consider six points equally spaced along a circle. They are partitioned into three clusters of points (red, green, and purple clusters in the figure). When the red and green clusters merge, they create a blue cluster. But $\mathtt{Ch}\xspace(\text{purple}, \text{blue}) < \min\{ \mathtt{Ch}\xspace(\text{purple}, \text{red}), \mathtt{Ch}\xspace(\text{purple}, \text{green}) \}$ since both vertices of the purple cluster have a neighbor (along the circle) in the blue cluster.
  • Figure 3: Demonstration of Observation \ref{['obs:min-monotone']}

Theorems & Definitions (4)

  • Definition 1: Chamfer Distance
  • Theorem 1
  • Theorem 2
  • Theorem 3