Table of Contents
Fetching ...

Sublinear Algorithms for Estimating Single-Linkage Clustering Costs

Pan Peng, Christian Sohler, Yi Xu

TL;DR

This work develops sublinear-time algorithms to estimate the costs of single-linkage clustering hierarchies, defining cost_k for k-clusterings and the total cost cost(G) (distance setting) as well as cost_k^(s) and cost^(s)(G) (similarity setting). Leveraging the CRT approach to MST weight estimation and reductions to counting connected components, the authors deliver sampling-based estimators with running times tilde O(d√W/ε^3) for distance and tilde O(dW/ε^3) for similarity, achieving (1±ε) relative accuracy on average for the entire cost profile and enabling fast retrieval of individual cost_k values via a succinct profile representation. They extend the framework to metric spaces with constant-time distance queries, yielding tilde O(n/ε^7) runtimes, and provide nearly matching lower bounds to demonstrate near-optimal sublinear performance. The paper also reports experiments on real networks, demonstrating practical efficiency and accurate cost-profile estimation for both distance and similarity graphs. Altogether, this work advances sublinear tools for hierarchical clustering by enabling accurate cost estimation and compact hierarchy summaries in large graphs.

Abstract

Single-linkage clustering is a fundamental method for data analysis. Algorithmically, one can compute a single-linkage $k$-clustering (a partition into $k$ clusters) by computing a minimum spanning tree and dropping the $k-1$ most costly edges. This clustering minimizes the sum of spanning tree weights of the clusters. This motivates us to define the cost of a single-linkage $k$-clustering as the weight of the corresponding spanning forest, denoted by $\mathrm{cost}_k$. Besides, if we consider single-linkage clustering as computing a hierarchy of clusterings, the total cost of the hierarchy is defined as the sum of the individual clusterings, denoted by $\mathrm{cost}(G) = \sum_{k=1}^{n} \mathrm{cost}_k$. In this paper, we assume that the distances between data points are given as a graph $G$ with average degree $d$ and edge weights from $\{1,\dots, W\}$. Given query access to the adjacency list of $G$, we present a sampling-based algorithm that computes a succinct representation of estimates $\widehat{\mathrm{cost}}_k$ for all $k$. The running time is $\tilde O(d\sqrt{W}/\varepsilon^3)$, and the estimates satisfy $\sum_{k=1}^{n} |\widehat{\mathrm{cost}}_k - \mathrm{cost}_k| \le \varepsilon\cdot \mathrm{cost}(G)$, for any $0<\varepsilon <1$. Thus we can approximate the cost of every $k$-clustering upto $(1+\varepsilon)$ factor \emph{on average}. In particular, our result ensures that we can estimate $\cost(G)$ upto a factor of $1\pm \varepsilon$ in the same running time. We also extend our results to the setting where edges represent similarities. In this case, the clusterings are defined by a maximum spanning tree, and our algorithms run in $\tilde{O}(dW/\varepsilon^3)$ time. We futher prove nearly matching lower bounds for estimating the total clustering cost and we extend our algorithms to metric space settings.

Sublinear Algorithms for Estimating Single-Linkage Clustering Costs

TL;DR

This work develops sublinear-time algorithms to estimate the costs of single-linkage clustering hierarchies, defining cost_k for k-clusterings and the total cost cost(G) (distance setting) as well as cost_k^(s) and cost^(s)(G) (similarity setting). Leveraging the CRT approach to MST weight estimation and reductions to counting connected components, the authors deliver sampling-based estimators with running times tilde O(d√W/ε^3) for distance and tilde O(dW/ε^3) for similarity, achieving (1±ε) relative accuracy on average for the entire cost profile and enabling fast retrieval of individual cost_k values via a succinct profile representation. They extend the framework to metric spaces with constant-time distance queries, yielding tilde O(n/ε^7) runtimes, and provide nearly matching lower bounds to demonstrate near-optimal sublinear performance. The paper also reports experiments on real networks, demonstrating practical efficiency and accurate cost-profile estimation for both distance and similarity graphs. Altogether, this work advances sublinear tools for hierarchical clustering by enabling accurate cost estimation and compact hierarchy summaries in large graphs.

Abstract

Single-linkage clustering is a fundamental method for data analysis. Algorithmically, one can compute a single-linkage -clustering (a partition into clusters) by computing a minimum spanning tree and dropping the most costly edges. This clustering minimizes the sum of spanning tree weights of the clusters. This motivates us to define the cost of a single-linkage -clustering as the weight of the corresponding spanning forest, denoted by . Besides, if we consider single-linkage clustering as computing a hierarchy of clusterings, the total cost of the hierarchy is defined as the sum of the individual clusterings, denoted by . In this paper, we assume that the distances between data points are given as a graph with average degree and edge weights from . Given query access to the adjacency list of , we present a sampling-based algorithm that computes a succinct representation of estimates for all . The running time is , and the estimates satisfy , for any . Thus we can approximate the cost of every -clustering upto factor \emph{on average}. In particular, our result ensures that we can estimate upto a factor of in the same running time. We also extend our results to the setting where edges represent similarities. In this case, the clusterings are defined by a maximum spanning tree, and our algorithms run in time. We futher prove nearly matching lower bounds for estimating the total clustering cost and we extend our algorithms to metric space settings.

Paper Structure

This paper contains 69 sections, 45 theorems, 157 equations, 14 figures, 3 tables, 12 algorithms.

Key Result

Theorem 1.1

Let $G$ be a weighted graph with edge weights in $\{1,\dots, W\}$ with average (unweighted) degree $d$. Assume that $\sqrt{W}\leq n$ and let $0< \varepsilon<1$ be a parameter. alg:appcost outputs an estimate $\widehat{\mathrm{cost}}(G)$ of the single-linkage clustering cost $\mathrm{cost}(G)$ in the The query complexity and running time of the algorithm are $O(\frac{\sqrt{W}d}{\varepsilon^3}\log^4

Figures (14)

  • Figure 1: Approximation ratio and normalized profiles
  • Figure 2: Datasets speed up
  • Figure 3: Approximation ratio in distance graphs: road networks
  • Figure 4: Approximation ratio in distance graphs: road networks
  • Figure 5: Approximation ratio in distance graphs: road networks
  • ...and 9 more figures

Theorems & Definitions (82)

  • Theorem 1.1
  • Theorem 1.2
  • Theorem 1.3
  • Theorem 1.4
  • Theorem 1.5
  • Theorem 1.6
  • Theorem 1.7
  • Theorem 2.1: The Chernoff--Hoeffding bound
  • Lemma 3.1
  • Lemma 3.2: chazelle2005approximating
  • ...and 72 more