Table of Contents
Fetching ...

Fitting Tree Metrics and Ultrametrics in Data Streams

Amir Carmel, Debarati Das, Evangelos Kipouridis, Evangelos Pipis

TL;DR

This work addresses fitting tree metrics and ultrametrics to pairwise distance data in the semi-streaming model, where the distance matrix arrives online. It delivers single-pass, memory-efficient algorithms with polytime guarantees for the $ ext{l}_0$ and $ ext{l}_ extinf$ ultrametric problems, derives $ ext{l}_1$- and combinatorial-approximation bounds from $ ext{l}_0$, and provides tight lower bounds that justify approximation in streaming. It further extends the results to tree metrics, achieving constant-factor approximations with only a small number of passes and showing how ultrametric-based techniques underpin the tree-metric fittings. Together, these contributions enable scalable hierarchical clustering on large, streaming datasets while clarifying the trade-offs between passes, space, and approximation quality.

Abstract

Fitting distances to tree metrics and ultrametrics are two widely used methods in hierarchical clustering, primarily explored within the context of numerical taxonomy. Given a positive distance function $D:\binom{V}{2}\rightarrow\mathbb{R}_{>0}$, the goal is to find a tree (or ultrametric) $T$ including all elements of set $V$ such that the difference between the distances among vertices in $T$ and those specified by $D$ is minimized. In this paper, we initiate the study of ultrametric and tree metric fitting problems in the semi-streaming model, where the distances between pairs of elements from $V$ (with $|V|=n$), defined by the function $D$, can arrive in an arbitrary order. We study these problems under various distance norms: For the $\ell_0$ objective, we provide a single-pass polynomial-time $\tilde{O}(n)$-space $O(1)$ approximation algorithm for ultrametrics and prove that no single-pass exact algorithm exists, even with exponential time. Next, we show that the algorithm for $\ell_0$ implies an $O(Δ/δ)$ approximation for the $\ell_1$ objective, where $Δ$ is the maximum and $δ$ is the minimum absolute difference between distances in the input. This bound matches the best-known approximation for the RAM model using a combinatorial algorithm when $Δ/δ=O(n)$. For the $\ell_\infty$ objective, we provide a complete characterization of the ultrametric fitting problem. We present a single-pass polynomial-time $\tilde{O}(n)$-space 2-approximation algorithm and show that no better than 2-approximation is possible, even with exponential time. We also show that, with an additional pass, it is possible to achieve a polynomial-time exact algorithm for ultrametrics. Finally, we extend the results for all these objectives to tree metrics by using only one additional pass through the stream and without asymptotically increasing the approximation factor.

Fitting Tree Metrics and Ultrametrics in Data Streams

TL;DR

This work addresses fitting tree metrics and ultrametrics to pairwise distance data in the semi-streaming model, where the distance matrix arrives online. It delivers single-pass, memory-efficient algorithms with polytime guarantees for the and ultrametric problems, derives - and combinatorial-approximation bounds from , and provides tight lower bounds that justify approximation in streaming. It further extends the results to tree metrics, achieving constant-factor approximations with only a small number of passes and showing how ultrametric-based techniques underpin the tree-metric fittings. Together, these contributions enable scalable hierarchical clustering on large, streaming datasets while clarifying the trade-offs between passes, space, and approximation quality.

Abstract

Fitting distances to tree metrics and ultrametrics are two widely used methods in hierarchical clustering, primarily explored within the context of numerical taxonomy. Given a positive distance function , the goal is to find a tree (or ultrametric) including all elements of set such that the difference between the distances among vertices in and those specified by is minimized. In this paper, we initiate the study of ultrametric and tree metric fitting problems in the semi-streaming model, where the distances between pairs of elements from (with ), defined by the function , can arrive in an arbitrary order. We study these problems under various distance norms: For the objective, we provide a single-pass polynomial-time -space approximation algorithm for ultrametrics and prove that no single-pass exact algorithm exists, even with exponential time. Next, we show that the algorithm for implies an approximation for the objective, where is the maximum and is the minimum absolute difference between distances in the input. This bound matches the best-known approximation for the RAM model using a combinatorial algorithm when . For the objective, we provide a complete characterization of the ultrametric fitting problem. We present a single-pass polynomial-time -space 2-approximation algorithm and show that no better than 2-approximation is possible, even with exponential time. We also show that, with an additional pass, it is possible to achieve a polynomial-time exact algorithm for ultrametrics. Finally, we extend the results for all these objectives to tree metrics by using only one additional pass through the stream and without asymptotically increasing the approximation factor.

Paper Structure

This paper contains 29 sections, 44 theorems, 28 equations, 2 algorithms.

Key Result

Theorem 1

There exists a single pass polynomial time semi-streaming algorithm that w.h.p. $O(1)$-approximates the $\ell_0$ Best-Fit Ultrametrics problem.

Theorems & Definitions (94)

  • Theorem 1
  • Corollary 2
  • proof
  • Theorem 3
  • Corollary 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • Corollary 8
  • ...and 84 more