Table of Contents
Fetching ...

A Surprisingly Simple Method for Distributed Euclidean-Minimum Spanning Tree / Single Linkage Dendrogram Construction from High Dimensional Embeddings via Distance Decomposition

Richard Lettich

TL;DR

The paper targets exact Euclidean MST computation in high dimensions where sub-quadratic MSTs are ineffective, with applications to single linkage dendrograms from neural-network embeddings. It introduces a decomposition-based method that uses a dense-MST kernel (d-MST) on subgraphs and then combines these subresults by computing the MST of their union. Correctness follows from an optimal substructure property and a decomposition theorem, ensuring the union of subgraph MSTs contains the global MST. Practically, the approach enables distributed parallelism with predictable bandwidth and work characteristics, offering a scalable route to geometric-MST and dendrogram construction in high-dimensional spaces.

Abstract

We introduce a decomposition method for the distributed calculation of exact Euclidean Minimum Spanning Trees in high dimensions (where sub-quadratic algorithms are not effective), or more generalized geometric-minimum spanning trees of complete graphs, where for each vertex $v\in V$ in the graph $G=(V,E)$ is represented by a vector in $\vec{v}\in \mathbb{R}^n$, and each for any edge, the the weight of the edge in the graph is given by a symmetric binary `distance' function between the representative vectors $w(\{x,y\}) = d(\vec{x},\vec{y})$. This is motivated by the task of clustering high dimensional embeddings produced by neural networks, where low-dimensional algorithms are ineffective; such geometric-minimum spanning trees find applications as a subroutine in the construction of single linkage dendrograms, as the two structures can be converted between each other efficiently.

A Surprisingly Simple Method for Distributed Euclidean-Minimum Spanning Tree / Single Linkage Dendrogram Construction from High Dimensional Embeddings via Distance Decomposition

TL;DR

The paper targets exact Euclidean MST computation in high dimensions where sub-quadratic MSTs are ineffective, with applications to single linkage dendrograms from neural-network embeddings. It introduces a decomposition-based method that uses a dense-MST kernel (d-MST) on subgraphs and then combines these subresults by computing the MST of their union. Correctness follows from an optimal substructure property and a decomposition theorem, ensuring the union of subgraph MSTs contains the global MST. Practically, the approach enables distributed parallelism with predictable bandwidth and work characteristics, offering a scalable route to geometric-MST and dendrogram construction in high-dimensional spaces.

Abstract

We introduce a decomposition method for the distributed calculation of exact Euclidean Minimum Spanning Trees in high dimensions (where sub-quadratic algorithms are not effective), or more generalized geometric-minimum spanning trees of complete graphs, where for each vertex in the graph is represented by a vector in , and each for any edge, the the weight of the edge in the graph is given by a symmetric binary `distance' function between the representative vectors . This is motivated by the task of clustering high dimensional embeddings produced by neural networks, where low-dimensional algorithms are ineffective; such geometric-minimum spanning trees find applications as a subroutine in the construction of single linkage dendrograms, as the two structures can be converted between each other efficiently.
Paper Structure (1 section, 2 theorems, 6 equations, 1 algorithm)

This paper contains 1 section, 2 theorems, 6 equations, 1 algorithm.

Table of Contents

  1. Algorithm

Key Result

Lemma 1

Let $G=(V,E)$ be a graph, $S\subseteq V$ be any subset of vertices, and $G[S]$ denote the induced subgraph. For all $G$ and $S$, the MSF obeys an optimal substructure property:

Theorems & Definitions (2)

  • Lemma 1
  • Theorem 1