Table of Contents
Fetching ...

Fitting trees to $\ell_1$-hyperbolic distances

Joon-Hyeok Yim, Anna C. Gilbert

TL;DR

This work investigates how to fit tree and ultrametric structures to given finite distance sets by leveraging a vectorized notion of hyperbolicity. It introduces AvgHyp and AvgUM as refined, scalable proxies for tree-likeness and proves that the optimal $\ell_p$ distortion of a tree or ultrametric fit can be bounded in terms of these averages, with tight constants. The authors present HCCTriangle-based hierarchical correlation clustering to obtain efficient ultrametric fits (HCCUltraFit) and a rooted-tree fit (HCCRootedTreeFit) with $O(n^2\log n)$ and $O(n^3\log n)$ time, respectively, and demonstrate that real-world hierarchical data sets often differ markedly from synthetic tree-like data in their distortion characteristics. The results suggest that more nuanced geometric notions are needed for learning and analysis tasks on practical data sets, where simple tree representations may be insufficient or misleading.

Abstract

Building trees to represent or to fit distances is a critical component of phylogenetic analysis, metric embeddings, approximation algorithms, geometric graph neural nets, and the analysis of hierarchical data. Much of the previous algorithmic work, however, has focused on generic metric spaces (i.e., those with no a priori constraints). Leveraging several ideas from the mathematical analysis of hyperbolic geometry and geometric group theory, we study the tree fitting problem as finding the relation between the hyperbolicity (ultrametricity) vector and the error of tree (ultrametric) embedding. That is, we define a vector of hyperbolicity (ultrametric) values over all triples of points and compare the $\ell_p$ norms of this vector with the $\ell_q$ norm of the distortion of the best tree fit to the distances. This formulation allows us to define the average hyperbolicity (ultrametricity) in terms of a normalized $\ell_1$ norm of the hyperbolicity vector. Furthermore, we can interpret the classical tree fitting result of Gromov as a $p = q = \infty$ result. We present an algorithm HCCRootedTreeFit such that the $\ell_1$ error of the output embedding is analytically bounded in terms of the $\ell_1$ norm of the hyperbolicity vector (i.e., $p = q = 1$) and that this result is tight. Furthermore, this algorithm has significantly different theoretical and empirical performance as compared to Gromov's result and related algorithms. Finally, we show using HCCRootedTreeFit and related tree fitting algorithms, that supposedly standard data sets for hierarchical data analysis and geometric graph neural networks have radically different tree fits than those of synthetic, truly tree-like data sets, suggesting that a much more refined analysis of these standard data sets is called for.

Fitting trees to $\ell_1$-hyperbolic distances

TL;DR

This work investigates how to fit tree and ultrametric structures to given finite distance sets by leveraging a vectorized notion of hyperbolicity. It introduces AvgHyp and AvgUM as refined, scalable proxies for tree-likeness and proves that the optimal distortion of a tree or ultrametric fit can be bounded in terms of these averages, with tight constants. The authors present HCCTriangle-based hierarchical correlation clustering to obtain efficient ultrametric fits (HCCUltraFit) and a rooted-tree fit (HCCRootedTreeFit) with and time, respectively, and demonstrate that real-world hierarchical data sets often differ markedly from synthetic tree-like data in their distortion characteristics. The results suggest that more nuanced geometric notions are needed for learning and analysis tasks on practical data sets, where simple tree representations may be insufficient or misleading.

Abstract

Building trees to represent or to fit distances is a critical component of phylogenetic analysis, metric embeddings, approximation algorithms, geometric graph neural nets, and the analysis of hierarchical data. Much of the previous algorithmic work, however, has focused on generic metric spaces (i.e., those with no a priori constraints). Leveraging several ideas from the mathematical analysis of hyperbolic geometry and geometric group theory, we study the tree fitting problem as finding the relation between the hyperbolicity (ultrametricity) vector and the error of tree (ultrametric) embedding. That is, we define a vector of hyperbolicity (ultrametric) values over all triples of points and compare the norms of this vector with the norm of the distortion of the best tree fit to the distances. This formulation allows us to define the average hyperbolicity (ultrametricity) in terms of a normalized norm of the hyperbolicity vector. Furthermore, we can interpret the classical tree fitting result of Gromov as a result. We present an algorithm HCCRootedTreeFit such that the error of the output embedding is analytically bounded in terms of the norm of the hyperbolicity vector (i.e., ) and that this result is tight. Furthermore, this algorithm has significantly different theoretical and empirical performance as compared to Gromov's result and related algorithms. Finally, we show using HCCRootedTreeFit and related tree fitting algorithms, that supposedly standard data sets for hierarchical data analysis and geometric graph neural networks have radically different tree fits than those of synthetic, truly tree-like data sets, suggesting that a much more refined analysis of these standard data sets is called for.
Paper Structure (35 sections, 14 theorems, 56 equations, 5 figures, 6 tables, 9 algorithms)

This paper contains 35 sections, 14 theorems, 56 equations, 5 figures, 6 tables, 9 algorithms.

Key Result

Proposition 2.1

We have the simple relations:

Figures (5)

  • Figure 1: Illustration of highly connectedness condition
  • Figure 2: This figure depicts how RestrictTree works. The idea is we can relocate every vertices so that $d(x,y_0) = d_T(x,y_0)$ holds for every $x \in X$.
  • Figure 3: This figure depicts the example that proves Gromov's distortion bound is asymptotically tight. By symmetry, we can conclude that Gromov's algorithm will always return the same $\ell_\infty$ error regardless of the choice of base point $w$.
  • Figure 4: This figure depicts the example output $d_U$ by drawing a dendrogram.
  • Figure 5: This figure depicts how the output $d_T$ looks like. This tree structure in fact can easily be obtained by utilizing the structure of dendrogram we computed.

Theorems & Definitions (23)

  • Proposition 2.1
  • Definition 2.2: $\mathbf{\ell_p/\ell_q}$ tree (ultrametric) fitting problem
  • Theorem 2.3
  • Theorem 2.4
  • Theorem 2.5
  • Theorem 2.6
  • Definition 3.1: HCC with triangle objectives
  • Proposition 3.2
  • Theorem 3.3
  • Theorem 3.4
  • ...and 13 more