Table of Contents
Fetching ...

Faster Weighted and Unweighted Tree Edit Distance and APSP Equivalence

Jakob Nogler, Adam Polak, Barna Saha, Virginia Vassilevska Williams, Yinzhan Xu, Christopher Ye

TL;DR

The paper resolves the long-standing question of whether tree edit distance (TED) is fine-grained equivalent to All-Pairs Shortest Paths (APSP) by presenting a tight reduction from TED (and its forest variants) to Min-Plus matrix multiplication, thereby achieving subcubic TED times in conjunction with Williams’ APSP algorithm, namely $n^3/2^{\Omega(\sqrt{\log n})}$. It introduces a unified alignment-graph framework that maps TED to border-to-border distances in forest alignment graphs and leverages structured min-plus products (MonMUL) to obtain $\tilde{O}(n^{(3+\omega)/2})$ time for unweighted TED. Central to the approach are Spine Edit Distance (SED) and Forest Edit Distance (FED) as intermediate problems, with tight reductions that connect SED/FED to APSP, including unbalanced instances, and comprehensive divide-and-conquer strategies (including DISED/UDISED) that preserve subquadratic factors. The results unify TED with APSP in the fine-grained landscape and deliver the fastest known algorithms for unweighted TED, while also clarifying the distinct computational nature of unweighted vs. weighted TED. Collectively, the work advances the understanding of TED’s complexity and provides practical subcubic algorithms grounded in state-of-the-art min-plus product techniques, with broad implications for related string/structure similarity problems.

Abstract

The tree edit distance (TED) between two rooted ordered trees with $n$ nodes labeled from an alphabet $Σ$ is the minimum cost of transforming one tree into the other by a sequence of valid operations consisting of insertions, deletions and relabeling of nodes. The tree edit distance is a well-known generalization of string edit distance and has been studied since the 1970s. Years of steady improvements have led to an $O(n^3)$ algorithm [DMRW 2010]. Fine-grained complexity casts light onto the hardness of TED showing that a truly subcubic time algorithm for TED implies a truly subcubic time algorithm for All-Pairs Shortest Paths (APSP) [BGMW 2020]. Therefore, under the popular APSP hypothesis, a truly subcubic time algorithm for TED cannot exist. However, unlike many problems in fine-grained complexity for which conditional hardness based on APSP also comes with equivalence to APSP, whether TED can be reduced to APSP has remained unknown. In this paper, we resolve this. Not only we show that TED is fine-grained equivalent to APSP, our reduction is tight enough, so that combined with the fastest APSP algorithm to-date [Williams 2018] it gives the first ever subcubic time algorithm for TED running in $n^3/2^{Ω(\sqrt{\log{n}})}$ time. We also consider the unweighted tree edit distance problem in which the cost of each edit is one. For unweighted TED, a truly subcubic algorithm is known due to Mao [Mao 2022], later improved slightly by Dürr [Dürr 2023] to run in $O(n^{2.9148})$. Their algorithm uses bounded monotone min-plus product as a crucial subroutine, and the best running time for this product is $\tilde{O}(n^{\frac{3+ω}{2}})\leq O(n^{2.6857})$ (where $ω$ is the exponent of fast matrix multiplication). In this work, we close this gap and give an algorithm for unweighted TED that runs in $\tilde{O}(n^{\frac{3+ω}{2}})$ time.

Faster Weighted and Unweighted Tree Edit Distance and APSP Equivalence

TL;DR

The paper resolves the long-standing question of whether tree edit distance (TED) is fine-grained equivalent to All-Pairs Shortest Paths (APSP) by presenting a tight reduction from TED (and its forest variants) to Min-Plus matrix multiplication, thereby achieving subcubic TED times in conjunction with Williams’ APSP algorithm, namely . It introduces a unified alignment-graph framework that maps TED to border-to-border distances in forest alignment graphs and leverages structured min-plus products (MonMUL) to obtain time for unweighted TED. Central to the approach are Spine Edit Distance (SED) and Forest Edit Distance (FED) as intermediate problems, with tight reductions that connect SED/FED to APSP, including unbalanced instances, and comprehensive divide-and-conquer strategies (including DISED/UDISED) that preserve subquadratic factors. The results unify TED with APSP in the fine-grained landscape and deliver the fastest known algorithms for unweighted TED, while also clarifying the distinct computational nature of unweighted vs. weighted TED. Collectively, the work advances the understanding of TED’s complexity and provides practical subcubic algorithms grounded in state-of-the-art min-plus product techniques, with broad implications for related string/structure similarity problems.

Abstract

The tree edit distance (TED) between two rooted ordered trees with nodes labeled from an alphabet is the minimum cost of transforming one tree into the other by a sequence of valid operations consisting of insertions, deletions and relabeling of nodes. The tree edit distance is a well-known generalization of string edit distance and has been studied since the 1970s. Years of steady improvements have led to an algorithm [DMRW 2010]. Fine-grained complexity casts light onto the hardness of TED showing that a truly subcubic time algorithm for TED implies a truly subcubic time algorithm for All-Pairs Shortest Paths (APSP) [BGMW 2020]. Therefore, under the popular APSP hypothesis, a truly subcubic time algorithm for TED cannot exist. However, unlike many problems in fine-grained complexity for which conditional hardness based on APSP also comes with equivalence to APSP, whether TED can be reduced to APSP has remained unknown. In this paper, we resolve this. Not only we show that TED is fine-grained equivalent to APSP, our reduction is tight enough, so that combined with the fastest APSP algorithm to-date [Williams 2018] it gives the first ever subcubic time algorithm for TED running in time. We also consider the unweighted tree edit distance problem in which the cost of each edit is one. For unweighted TED, a truly subcubic algorithm is known due to Mao [Mao 2022], later improved slightly by Dürr [Dürr 2023] to run in . Their algorithm uses bounded monotone min-plus product as a crucial subroutine, and the best running time for this product is (where is the exponent of fast matrix multiplication). In this work, we close this gap and give an algorithm for unweighted TED that runs in time.

Paper Structure

This paper contains 33 sections, 55 theorems, 22 equations, 8 figures, 1 table.

Key Result

corollary 1

Let $\mathbf{F}, \mathbf{F}'$ be forests. Then, there is an algorithm for TED running in time $n^3/2^{\Omega(\sqrt{\log n})}$, where $n = \max(|\mathbf{F}|,|\mathbf{F}'|)$. \begin{tikzpicture}[baseline=(t-text.base)]{\pic (t) at (0,0) {hookright={thick,draw=\@dis@clr, fill=\@dis@bg}}; }\end{tikzpict

Figures (8)

  • Figure 1: Overlaying the alignment graphs for string edit distance for $(L, L')$ and $(R, R')$ leads to an intuitive visualization of TED on caterpillar trees under the assumption that central nodes are only mapped to central nodes, left children only to left ones, and right children only to right ones (\ref{['assm:cated_mapping']}). The problem can be visualized as two paths in the two graphs, which, whenever they intersect, allow mapping of central nodes to central nodes.
  • Figure 2: The figure illustrates an instance of the recursive scheme used to solve TED on caterpillars. The inputs can be visualized as fixing the starting points of the two paths on the upper right border (in orange) of the rectangle, and we are determining the maximum value achievable from there onwards. We are required to compute the same values for the lower left border (in purple). The divide-et-impera scheme divides the rectangle vertically into two smaller ones, parameterized by the corners $(a,a'), (r,b')$ and $(r,a'), (b,b')$. The figure also demonstrates how to compute the inputs for the scheme on the left subrectangle for one specific case: one path leaves the upper border, and the other exits from the right border.
  • Figure 3: The general case for similarity mappings on caterpillar trees.
  • Figure 4: To compute $\mathsf{sim}(\mathbf{F}, \mathbf{F}')$ for the two depicted trees $\mathbf{T}$ and $\mathbf{T}'$ under \ref{['assm:sed_mapping']}, we can use the same visualization approach as before. Specifically, we overlay two tree alignment graphs corresponding to the concatenated left and right subtrees, tracing two paths that, whenever they intersect at two spine nodes, provide the possibility to map them. These graphs correspond to $L_1 L_2 \cdots L_m$ vs. $L_1' L_2' \cdots L'_{m'}$ and $\mathsf{rev}(R_1 R_2 \cdots R_m)$ vs. $\mathsf{rev}(R_1' R_2' \cdots R'_{m'})$, shown in blue and teal, respectively. Given the similarity values $\mathsf{sim}(\mathsf{sub}(v), \mathsf{sub}(v'))$ for all $(v, v') \in (\mathbf{F} \times \mathbf{F}') \setminus (\mathbf{S} \times \mathbf{S}')$, we can construct these trees. In Spine Edit Distance (SED), instead of computing only $\mathsf{sim}(\mathbf{F}, \mathbf{F}')$, we also need to determine $\mathsf{sim}(\mathsf{sub}(s), \mathsf{sub}(s'))$ for all $(s, s') \in (\mathbf{S} \times \mathbf{S}')$. Fortunately, by employing a similar divide-and-conquer approach as used for caterpillars, computing $\mathsf{sim}(\mathbf{F}, \mathbf{F}')$ naturally leads to obtaining these values as well. When designing such a divide-and-conquer scheme, the subproblems must be indexed by rectangles whose edges align with coordinates corresponding to spine nodes. This ensures that no diagonal edges in the tree alignment graphs "jump over" the sides of the rectangle, allowing for a "clean" partitioning into subproblems. As with caterpillar trees, removing \ref{['assm:sed_mapping']} would introduce a third path between the two existing ones.
  • Figure 5: The alignment whose value is computed in \ref{['eq:unweighted_assm_b']} visualized as a Border-to-Border (BBD) distance computation. The value of the orange path corresponds to the first summand, or the alignment between $\mathbf{F}\bm{[}\,\mathsf{l}(s_{i})\,\bm{.\,.}\,\mathsf{l}(s_{i + 1}\,\bm{)}$ and $\mathbf{F}'\bm{[}\,x'\,\bm{.\,.}\,w'\,\bm{)}$. The value of the blue path corresponds to the last summand, or the alignment between $\mathbf{F}\bm{[}\,\mathsf{r}(s_{i + 1})\,\bm{.\,.}\,\mathsf{r}(s_{i}\,\bm{)}$ and $\mathbf{F}'\bm{[}\,z'\,\bm{.\,.}\,y'\,\bm{)}$. For every pair of points on the right border, we have the optimal alignment between $\mathsf{sub}(s_{i + 1})$ and $\mathbf{F}'\bm{[}\,w'\,\bm{.\,.}\,z'\,\bm{)}$.
  • ...and 3 more figures

Theorems & Definitions (68)

  • corollary 1
  • corollary 2
  • lemma 2
  • theorem 3
  • theorem 4
  • definition 5
  • definition 6: M22
  • definition 7: M22
  • definition 8: M22
  • definition 9: Anchors and anchor sets
  • ...and 58 more