Sublinear Algorithms for Estimating Single-Linkage Clustering Costs

Pan Peng; Christian Sohler; Yi Xu

Sublinear Algorithms for Estimating Single-Linkage Clustering Costs

Pan Peng, Christian Sohler, Yi Xu

TL;DR

This work develops sublinear-time algorithms to estimate the costs of single-linkage clustering hierarchies, defining cost_k for k-clusterings and the total cost cost(G) (distance setting) as well as cost_k^(s) and cost^(s)(G) (similarity setting). Leveraging the CRT approach to MST weight estimation and reductions to counting connected components, the authors deliver sampling-based estimators with running times tilde O(d√W/ε^3) for distance and tilde O(dW/ε^3) for similarity, achieving (1±ε) relative accuracy on average for the entire cost profile and enabling fast retrieval of individual cost_k values via a succinct profile representation. They extend the framework to metric spaces with constant-time distance queries, yielding tilde O(n/ε^7) runtimes, and provide nearly matching lower bounds to demonstrate near-optimal sublinear performance. The paper also reports experiments on real networks, demonstrating practical efficiency and accurate cost-profile estimation for both distance and similarity graphs. Altogether, this work advances sublinear tools for hierarchical clustering by enabling accurate cost estimation and compact hierarchy summaries in large graphs.

Abstract

Single-linkage clustering is a fundamental method for data analysis. Algorithmically, one can compute a single-linkage $k$-clustering (a partition into $k$ clusters) by computing a minimum spanning tree and dropping the $k-1$ most costly edges. This clustering minimizes the sum of spanning tree weights of the clusters. This motivates us to define the cost of a single-linkage $k$-clustering as the weight of the corresponding spanning forest, denoted by $\mathrm{cost}_k$. Besides, if we consider single-linkage clustering as computing a hierarchy of clusterings, the total cost of the hierarchy is defined as the sum of the individual clusterings, denoted by $\mathrm{cost}(G) = \sum_{k=1}^{n} \mathrm{cost}_k$. In this paper, we assume that the distances between data points are given as a graph $G$ with average degree $d$ and edge weights from $\{1,\dots, W\}$. Given query access to the adjacency list of $G$, we present a sampling-based algorithm that computes a succinct representation of estimates $\widehat{\mathrm{cost}}_k$ for all $k$. The running time is $\tilde O(d\sqrt{W}/\varepsilon^3)$, and the estimates satisfy $\sum_{k=1}^{n} |\widehat{\mathrm{cost}}_k - \mathrm{cost}_k| \le \varepsilon\cdot \mathrm{cost}(G)$, for any $0<\varepsilon <1$. Thus we can approximate the cost of every $k$-clustering upto $(1+\varepsilon)$ factor \emph{on average}. In particular, our result ensures that we can estimate $\cost(G)$ upto a factor of $1\pm \varepsilon$ in the same running time. We also extend our results to the setting where edges represent similarities. In this case, the clusterings are defined by a maximum spanning tree, and our algorithms run in $\tilde{O}(dW/\varepsilon^3)$ time. We futher prove nearly matching lower bounds for estimating the total clustering cost and we extend our algorithms to metric space settings.

Sublinear Algorithms for Estimating Single-Linkage Clustering Costs

TL;DR

Abstract

Single-linkage clustering is a fundamental method for data analysis. Algorithmically, one can compute a single-linkage

-clustering (a partition into

clusters) by computing a minimum spanning tree and dropping the

most costly edges. This clustering minimizes the sum of spanning tree weights of the clusters. This motivates us to define the cost of a single-linkage

-clustering as the weight of the corresponding spanning forest, denoted by

. Besides, if we consider single-linkage clustering as computing a hierarchy of clusterings, the total cost of the hierarchy is defined as the sum of the individual clusterings, denoted by

. In this paper, we assume that the distances between data points are given as a graph

with average degree

and edge weights from

. Given query access to the adjacency list of

, we present a sampling-based algorithm that computes a succinct representation of estimates

for all

. The running time is

, and the estimates satisfy

, for any

. Thus we can approximate the cost of every

-clustering upto

factor \emph{on average}. In particular, our result ensures that we can estimate

upto a factor of

in the same running time. We also extend our results to the setting where edges represent similarities. In this case, the clusterings are defined by a maximum spanning tree, and our algorithms run in

time. We futher prove nearly matching lower bounds for estimating the total clustering cost and we extend our algorithms to metric space settings.

Sublinear Algorithms for Estimating Single-Linkage Clustering Costs

TL;DR

Abstract

Sublinear Algorithms for Estimating Single-Linkage Clustering Costs

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (82)