Table of Contents
Fetching ...

A $(1+ε)$-Approximation for Ultrametric Embedding in Subquadratic Time

Gabriel Bathie, Guillaume Lagarde

TL;DR

The paper tackles the BUF$_\infty$ problem for ultrametric embedding, introducing subquadratic algorithms that achieve arbitrarily precise approximations. It combines a $\gamma$-Kruskal Tree built via LSH-guided BFS with a dynamic approximate farthest-neighbor data structure to obtain an $\alpha$-approximation of cut weights, yielding a $c$-approximation of BUF$_\infty$ in $\tilde{O}(n^{1+1/c})$ time, and specifically a $(1+\varepsilon)$-approximation in $\tilde{O}(n^{2-\varepsilon+o(\varepsilon^2)})$. The approach avoids reliance on subquadratic spanners and leverages locality-sensitive hashing and dynamic AFN to achieve subquadratic scaling in Euclidean spaces. Experimental results on real and synthetic data demonstrate improved approximation quality with competitive running times and scalability to large datasets, confirming practical viability. Overall, the work advances subquadratic, high-precision ultrametric embeddings for hierarchical clustering and related analyses.

Abstract

Efficiently computing accurate representations of high-dimensional data is essential for data analysis and unsupervised learning. Dendrograms, also known as ultrametrics, are widely used representations that preserve hierarchical relationships within the data. However, popular methods for computing them, such as linkage algorithms, suffer from quadratic time and space complexity, making them impractical for large datasets. The "best ultrametric embedding" (a.k.a. "best ultrametric fit") problem, which aims to find the ultrametric that best preserves the distances between points in the original data, is known to require at least quadratic time for an exact solution. Recent work has focused on improving scalability by approximating optimal solutions in subquadratic time, resulting in a $(\sqrt{2} + ε)$-approximation (Cohen-Addad, de Joannis de Verclos and Lagarde, 2021). In this paper, we present the first subquadratic algorithm that achieves arbitrarily precise approximations of the optimal ultrametric embedding. Specifically, we provide an algorithm that, for any $c \geq 1$, outputs a $c$-approximation of the best ultrametric in time $\tilde{O}(n^{1 + 1/c})$. In particular, for any fixed $ε> 0$, the algorithm computes a $(1+ε)$-approximation in time $\tilde{O}(n^{2 - ε+ o(ε^2)})$. Experimental results show that our algorithm improves upon previous methods in terms of approximation quality while maintaining comparable running times.

A $(1+ε)$-Approximation for Ultrametric Embedding in Subquadratic Time

TL;DR

The paper tackles the BUF problem for ultrametric embedding, introducing subquadratic algorithms that achieve arbitrarily precise approximations. It combines a -Kruskal Tree built via LSH-guided BFS with a dynamic approximate farthest-neighbor data structure to obtain an -approximation of cut weights, yielding a -approximation of BUF in time, and specifically a -approximation in . The approach avoids reliance on subquadratic spanners and leverages locality-sensitive hashing and dynamic AFN to achieve subquadratic scaling in Euclidean spaces. Experimental results on real and synthetic data demonstrate improved approximation quality with competitive running times and scalability to large datasets, confirming practical viability. Overall, the work advances subquadratic, high-precision ultrametric embeddings for hierarchical clustering and related analyses.

Abstract

Efficiently computing accurate representations of high-dimensional data is essential for data analysis and unsupervised learning. Dendrograms, also known as ultrametrics, are widely used representations that preserve hierarchical relationships within the data. However, popular methods for computing them, such as linkage algorithms, suffer from quadratic time and space complexity, making them impractical for large datasets. The "best ultrametric embedding" (a.k.a. "best ultrametric fit") problem, which aims to find the ultrametric that best preserves the distances between points in the original data, is known to require at least quadratic time for an exact solution. Recent work has focused on improving scalability by approximating optimal solutions in subquadratic time, resulting in a -approximation (Cohen-Addad, de Joannis de Verclos and Lagarde, 2021). In this paper, we present the first subquadratic algorithm that achieves arbitrarily precise approximations of the optimal ultrametric embedding. Specifically, we provide an algorithm that, for any , outputs a -approximation of the best ultrametric in time . In particular, for any fixed , the algorithm computes a -approximation in time . Experimental results show that our algorithm improves upon previous methods in terms of approximation quality while maintaining comparable running times.

Paper Structure

This paper contains 29 sections, 14 theorems, 12 equations, 4 figures, 4 tables, 3 algorithms.

Key Result

Theorem 1

For any $\gamma \geq 1$ and $\alpha > 1$, there exists an algorithm that computes a $\gamma \cdot \alpha$-approximation of $\textsc{BUF}_\infty$ in time $\tilde{O}(n^{1 + 1/\gamma^2} + n^{1 + 1/\alpha^2})$ and space $\tilde{O}(n^{1 + 1/\gamma^2} + n^{1 + 1/\alpha^2})$.

Figures (4)

  • Figure 1: Points and clusters at three different levels of granularity (top) and the corresponding dendrogram (bottom).
  • Figure 2: Illustration of the connected components defined by an edge $e$. The cut weights is the maximal distance between a point in $L(e)$ and a point in $R(e)$.
  • Figure 3: Approximation factor obtained by FastUlt for different values of $c$.
  • Figure 4: Accuracy of the $\alpha$-approximate cut weights algorithm. For each value of $\alpha$, the algorithm is run 30 times on each of the 5 datasets, resulting in 150 data points per boxplot.

Theorems & Definitions (28)

  • Theorem 1
  • Definition 2
  • Definition 3
  • Theorem 4: CKL20
  • Theorem 5
  • Theorem 6
  • Corollary 7
  • proof
  • Definition 8
  • Definition 9
  • ...and 18 more