Table of Contents
Fetching ...

Approximate Tree Completion and Learning-Augmented Algorithms for Metric Minimum Spanning Trees

Nate Veldt, Thomas Stanley, Benjamin W. Priest, Trevor Steil, Keita Iwabuchi, T. S. Jayram, Geoffrey Sanders

TL;DR

The paper tackles the challenge of computing metric minimum spanning trees in arbitrary metric spaces where naive approaches incur $Ω(n^2)$ distance queries. It introduces Metric Forest Completion (MFC), which starts from an initial forest and completes it by solving a subquadratic approximation on a coarsened graph; the main algorithm, MFC-Approx, achieves a $<2.62$-approximation and supports a learning-augmented guarantee that ties performance to the overlap with an optimal MST. The key theoretical contribution is a bound of the form $w_{\mathcal{X}}(\hat{T}) \le β w^{*}(T^{*}_{\mathcal{P}}) + (β/(β-1)+1)\sum w_{\mathcal{X}}(T_i)$ with $β=(3+\sqrt{5})/2$, plus a γ-overlap based refinement that yields a $(2γ+1)$-approximation under subquadratic runtime when the initial forest is of size $t = o(n)$. Empirically, the framework delivers substantial runtime improvements and maintains near-optimal spanning trees across diverse metrics and data regimes, especially when clustering structure is present, highlighting its practical impact for large-scale clustering and ML pipelines.

Abstract

Finding a minimum spanning tree (MST) for $n$ points in an arbitrary metric space is a fundamental primitive for hierarchical clustering and many other ML tasks, but this takes $Ω(n^2)$ time to even approximate. We introduce a framework for metric MSTs that first (1) finds a forest of disconnected components using practical heuristics, and then (2) finds a small weight set of edges to connect disjoint components of the forest into a spanning tree. We prove that optimally solving the second step still takes $Ω(n^2)$ time, but we provide a subquadratic 2.62-approximation algorithm. In the spirit of learning-augmented algorithms, we then show that if the forest found in step (1) overlaps with an optimal MST, we can approximate the original MST problem in subquadratic time, where the approximation factor depends on a measure of overlap. In practice, we find nearly optimal spanning trees for a wide range of metrics, while being orders of magnitude faster than exact algorithms.

Approximate Tree Completion and Learning-Augmented Algorithms for Metric Minimum Spanning Trees

TL;DR

The paper tackles the challenge of computing metric minimum spanning trees in arbitrary metric spaces where naive approaches incur distance queries. It introduces Metric Forest Completion (MFC), which starts from an initial forest and completes it by solving a subquadratic approximation on a coarsened graph; the main algorithm, MFC-Approx, achieves a -approximation and supports a learning-augmented guarantee that ties performance to the overlap with an optimal MST. The key theoretical contribution is a bound of the form with , plus a γ-overlap based refinement that yields a -approximation under subquadratic runtime when the initial forest is of size . Empirically, the framework delivers substantial runtime improvements and maintains near-optimal spanning trees across diverse metrics and data regimes, especially when clustering structure is present, highlighting its practical impact for large-scale clustering and ML pipelines.

Abstract

Finding a minimum spanning tree (MST) for points in an arbitrary metric space is a fundamental primitive for hierarchical clustering and many other ML tasks, but this takes time to even approximate. We introduce a framework for metric MSTs that first (1) finds a forest of disconnected components using practical heuristics, and then (2) finds a small weight set of edges to connect disjoint components of the forest into a spanning tree. We prove that optimally solving the second step still takes time, but we provide a subquadratic 2.62-approximation algorithm. In the spirit of learning-augmented algorithms, we then show that if the forest found in step (1) overlaps with an optimal MST, we can approximate the original MST problem in subquadratic time, where the approximation factor depends on a measure of overlap. In practice, we find nearly optimal spanning trees for a wide range of metrics, while being orders of magnitude faster than exact algorithms.

Paper Structure

This paper contains 34 sections, 4 theorems, 30 equations, 6 figures, 2 tables, 1 algorithm.

Key Result

Theorem 3.1

Every optimal algorithm for MFC has $\Omega(n^2)$ query complexity. Furthermore, for any multiplicative factor $p \geq 1$ (not necessarily a constant), any algorithm that finds a set $M \subseteq \mathcal{I}$ that is feasible for eq:mfc and satisfies $w_\mathcal{X}(M) \leq p \cdot w_{\mathcal{X}}(M^

Figures (6)

  • Figure 1: (a) Consider a simple example of a finite metric space $(\mathcal{X},d)$: 75 points in $\mathbb{R}^2$ equipped with Euclidean distance. $G_\mathcal{X}$ is an implicit complete graph obtained by computing distances between all pairs of points. (b) Kruskal's algorithm iteratively merges components in a growing forest. Here we display the forest at an intermediate step. Knowing the minimum distance between two different components requires solving a bichromatic closest pair problem. (c) Continuing until all components are merged produces a metric minimum spanning tree.
  • Figure 2: (a) We display an optimal metric MST for a toy example with $|\mathcal{X}| = 75$ points. Our framework and algorithm apply to general metric spaces, but for visualization purposes our figures focus on 2-dimensional Euclidean space. (b) The Metric Forest Completion problem is given a partitioning $\mathcal{P}$ and spanning trees $\{T_i\}$ for components of the partition. For this illustration we used a $k$-means algorithm with $k = 5$ computed optimal spanning trees of components using the naive approach. (c) The true MST overlaps significantly with the initial partial spanning tree, but its induced subgraph on each component is not necessarily connected. For this example, the $\gamma$-overlap (see Section \ref{['sec:learningaugmented']}) is $\gamma \leq 1.12$. (d) The optimal completion set $M^*$ is shown in orange; combining it with the spanning trees of the partial spanning tree produces a spanning tree for all of $\mathcal{X}$. (e) The coarsened graph $G_\mathcal{P}$ has a node $v_i$ for each component $P_i \in \mathcal{P}$. Solving $O(t^2)$ bichromatic closest pair problems identifies the closest pair of points between each pair of clusters, defining an optimal weight function $w^*$ on $G_\mathcal{P}$. (f) Finding the minimum-weight completion set $M^*$ amounts to finding the MST of $G_\mathcal{P}$ with respect to weight function $w^*$.
  • Figure 3: (a) Finding the minimum distance between components $P_i$ and $P_j$ (dashed line) is an expensive bichromatic closest pair problem. MFC-Approx instead performs a cheaper nearest neighbor query for a representative point in each component ($s_i$ and $s_j$, shown as stars). The algorithm finds the closest point to each representative from the opposite cluster, then takes the minimum of the two distances. (b) Applying this to each pair of components produces a weight function $\hat{w}$ for the coarsened graph $G_\mathcal{P}$. Finding an MST of $G_\mathcal{P}$ with respect to $\hat{w}$ yields (c) a 2.62-approximation for MFC.
  • Figure 4: Results on synthetic uniform random data for dimensions $d \in \{4, 8, 16, 32, 256\}.$ Each point in each plot represents an average over $16$ sampled point clouds for a fixed $n$ and choice of component number $t$. Runtime ratio is the ratio between the runtime for the optimal MST algorithm divided by the runtime of our MFC framework (including initial forest generation). Cost ratio is the ratio between the spanning tree weight for our method and the optimal MST weight. The $\gamma$ upper bound is computed by comparing the initial forest overlap with the one optimal MST computed.
  • Figure 5: Results for Fashion-MNIST. Each point is the average of 16 samples for fixed $n$ and $t$.
  • ...and 1 more figures

Theorems & Definitions (10)

  • Theorem 3.1
  • proof
  • Lemma 4.1
  • proof
  • proof
  • proof
  • Theorem 4.2
  • proof
  • Theorem 4.3
  • proof