Chamfer-Linkage for Hierarchical Agglomerative Clustering
Kishen N Gowda, Willem Fletcher, MohammadHossein Bateni, Laxman Dhulipala, D Ellis Hershkowitz, Rajesh Jayaram, Jakub Łącki
TL;DR
Chamfer-linkage addresses HAC by introducing an asymmetric cluster distance $Ch(A,B)=sum_{a in A} min_{b in B} d(a,b)$, which better captures concept representation and promotes balanced dendrograms. The authors show an exact $O(n^2)$-time HAC algorithm within the HAC-NN framework and provide a space–time trade-off that reduces memory to $O(n^2/t)$ at the cost of $O(n^2 t)$ time, with $t\in[1,n]$. Empirically, Chamfer-linkage beats classical linkages (e.g., average, Ward) on 19 diverse datasets in ARI and NMI while producing well-balanced hierarchies, and the authors release optimized C++ implementations and Python bindings. The work demonstrates that Chamfer-linkage is a practical drop-in replacement for traditional linkages and motivates scalable parallel and approximate variants for very large datasets.
Abstract
Hierarchical Agglomerative Clustering (HAC) is a widely-used clustering method based on repeatedly merging the closest pair of clusters, where inter-cluster distances are determined by a linkage function. Unlike many clustering methods, HAC does not optimize a single explicit global objective; clustering quality is therefore primarily evaluated empirically, and the choice of linkage function plays a crucial role in practice. However, popular classical linkages, such as single-linkage, average-linkage and Ward's method show high variability across real-world datasets and do not consistently produce high-quality clusterings in practice. In this paper, we propose \emph{Chamfer-linkage}, a novel linkage function that measures the distance between clusters using the Chamfer distance, a popular notion of distance between point-clouds in machine learning and computer vision. We argue that Chamfer-linkage satisfies desirable concept representation properties that other popular measures struggle to satisfy. Theoretically, we show that Chamfer-linkage HAC can be implemented in $O(n^2)$ time, matching the efficiency of classical linkage functions. Experimentally, we find that Chamfer-linkage consistently yields higher-quality clusterings than classical linkages such as average-linkage and Ward's method across a diverse collection of datasets. Our results establish Chamfer-linkage as a practical drop-in replacement for classical linkage functions, broadening the toolkit for hierarchical clustering in both theory and practice.
