Approximate Tree Completion and Learning-Augmented Algorithms for Metric Minimum Spanning Trees
Nate Veldt, Thomas Stanley, Benjamin W. Priest, Trevor Steil, Keita Iwabuchi, T. S. Jayram, Geoffrey Sanders
TL;DR
The paper tackles the challenge of computing metric minimum spanning trees in arbitrary metric spaces where naive approaches incur $Ω(n^2)$ distance queries. It introduces Metric Forest Completion (MFC), which starts from an initial forest and completes it by solving a subquadratic approximation on a coarsened graph; the main algorithm, MFC-Approx, achieves a $<2.62$-approximation and supports a learning-augmented guarantee that ties performance to the overlap with an optimal MST. The key theoretical contribution is a bound of the form $w_{\mathcal{X}}(\hat{T}) \le β w^{*}(T^{*}_{\mathcal{P}}) + (β/(β-1)+1)\sum w_{\mathcal{X}}(T_i)$ with $β=(3+\sqrt{5})/2$, plus a γ-overlap based refinement that yields a $(2γ+1)$-approximation under subquadratic runtime when the initial forest is of size $t = o(n)$. Empirically, the framework delivers substantial runtime improvements and maintains near-optimal spanning trees across diverse metrics and data regimes, especially when clustering structure is present, highlighting its practical impact for large-scale clustering and ML pipelines.
Abstract
Finding a minimum spanning tree (MST) for $n$ points in an arbitrary metric space is a fundamental primitive for hierarchical clustering and many other ML tasks, but this takes $Ω(n^2)$ time to even approximate. We introduce a framework for metric MSTs that first (1) finds a forest of disconnected components using practical heuristics, and then (2) finds a small weight set of edges to connect disjoint components of the forest into a spanning tree. We prove that optimally solving the second step still takes $Ω(n^2)$ time, but we provide a subquadratic 2.62-approximation algorithm. In the spirit of learning-augmented algorithms, we then show that if the forest found in step (1) overlaps with an optimal MST, we can approximate the original MST problem in subquadratic time, where the approximation factor depends on a measure of overlap. In practice, we find nearly optimal spanning trees for a wide range of metrics, while being orders of magnitude faster than exact algorithms.
