Table of Contents
Fetching ...

The Path-Label Reconciliation (PLR) Dissimilarity Measure for Gene Trees

Alitzel López Sánchez, José Antonio Ramírez-Rafael, Alejandro Flores-Lamas, Maribel Hernández-Rosales, Manuel Lafond

TL;DR

PLR provides a linear-time dissimilarity for comparing reconciled gene trees on the same species tree by combining $d_{path}$ and $d_{lbl}$ with a weight $α \in [0,1]$, yielding $d_{plr}$. It defines $m(v) = \mathrm{lca}_{G_2}(L(G_1(v)))$ to compare node mappings and accounts for both topology and event labels. Empirically, PLR yields a more evenly distributed distance and is less prone to overestimating small topological changes than RF-based metrics, while remaining computationally efficient. The work also derives diameter bounds, discusses equivalence under least-duplication-resolved reductions, and outlines future directions including triangle-inequality properties for binary trees and integration with existing bioinformatics tools.

Abstract

In this study, we investigate the problem of comparing gene trees reconciled with the same species tree using a novel semi-metric, called the Path-Label Reconciliation (PLR) dissimilarity measure. This approach not only quantifies differences in the topology of reconciled gene trees, but also considers discrepancies in predicted ancestral gene-species maps and speciation/duplication events, offering a refinement of existing metrics such as Robinson-Foulds (RF) and their labeled extensions LRF and ELRF. A tunable parameter α also allows users to adjust the balance between its species map and event labeling components. We show that PLR can be computed in linear time and that it is a semi-metric. We also discuss the diameters of reconciled gene tree measures, which are important in practice for normalization, and provide initial bounds on PLR, LRF, and ELRF. To validate PLR, we simulate reconciliations and perform comparisons with LRF and ELRF. The results show that PLR provides a more evenly distributed range of distances, making it less susceptible to overestimating differences in the presence of small topological changes, while at the same time being computationally efficient. Our findings suggest that the theoretical diameter is rarely reached in practice. The PLR measure advances phylogenetic reconciliation by combining theoretical rigor with practical applicability. Future research will refine its mathematical properties, explore its performance on different tree types, and integrate it with existing bioinformatics tools for large-scale evolutionary analyses. The open source code is available at: https://pypi.org/project/parle/.

The Path-Label Reconciliation (PLR) Dissimilarity Measure for Gene Trees

TL;DR

PLR provides a linear-time dissimilarity for comparing reconciled gene trees on the same species tree by combining and with a weight , yielding . It defines to compare node mappings and accounts for both topology and event labels. Empirically, PLR yields a more evenly distributed distance and is less prone to overestimating small topological changes than RF-based metrics, while remaining computationally efficient. The work also derives diameter bounds, discusses equivalence under least-duplication-resolved reductions, and outlines future directions including triangle-inequality properties for binary trees and integration with existing bioinformatics tools.

Abstract

In this study, we investigate the problem of comparing gene trees reconciled with the same species tree using a novel semi-metric, called the Path-Label Reconciliation (PLR) dissimilarity measure. This approach not only quantifies differences in the topology of reconciled gene trees, but also considers discrepancies in predicted ancestral gene-species maps and speciation/duplication events, offering a refinement of existing metrics such as Robinson-Foulds (RF) and their labeled extensions LRF and ELRF. A tunable parameter α also allows users to adjust the balance between its species map and event labeling components. We show that PLR can be computed in linear time and that it is a semi-metric. We also discuss the diameters of reconciled gene tree measures, which are important in practice for normalization, and provide initial bounds on PLR, LRF, and ELRF. To validate PLR, we simulate reconciliations and perform comparisons with LRF and ELRF. The results show that PLR provides a more evenly distributed range of distances, making it less susceptible to overestimating differences in the presence of small topological changes, while at the same time being computationally efficient. Our findings suggest that the theoretical diameter is rarely reached in practice. The PLR measure advances phylogenetic reconciliation by combining theoretical rigor with practical applicability. Future research will refine its mathematical properties, explore its performance on different tree types, and integrate it with existing bioinformatics tools for large-scale evolutionary analyses. The open source code is available at: https://pypi.org/project/parle/.
Paper Structure (10 sections, 9 theorems, 14 equations, 7 figures, 2 algorithms)

This paper contains 10 sections, 9 theorems, 14 equations, 7 figures, 2 algorithms.

Key Result

Lemma 1

Let $\mathcal{G} = (G, S, \mu, l)$ be a reconciled gene tree that is least duplication-resolved. Let $u,v \in V(G)$ be such that $v \prec_G u$. Then either $\mu(u) \neq \mu(v)$ or $l(u) \neq l(v)$.

Figures (7)

  • Figure 1: In the upper row, there are two reconciled gene trees $G_1$ and $G_2$ as well as a species tree $S$. The event labelings are shown as red circles and blue squares, which represent speciations and duplications, respectively. Lowercase letters $a,b,c,d$ depict extant genes, while the corresponding uppercase letters are the species where genes reside. The maps $\mu_1, \mu_2$ use the lca-mapping, that is, $\mu_1(x_0) = z_0, \mu_1(x_1) = z_1, \mu_1(x_2) = z_2$, and $\mu_2(y_0) = \mu_2(y_1) = z_0, \mu_2(y_2) = z_2$. The gene trees have the same set of leaves but different topology and event labeling. Purple arrows exemplify the maps $m_{\mathcal{G}_1, \mathcal{G}_2}(x_1)$, which is the lca of genes $c$ and $d$, and $m_{\mathcal{G}_2, \mathcal{G}_1}(y_0)$, while green arrows illustrate the species map $\mu_2$. The shaded edge in $S$ displays the path distance between $\mu_1(x_1) = z_1$ and $\mu_2(m(x_1)) = \mu_2(y_0) = z_0$. The lower row shows the explicit evolution of the gene trees within the species tree. The contribution of $x_1$ to the $d_{path}$ component is $1$, because $dist_S( \mu_1(x_1), \mu_2( m(x_1)) ) = 1$, whereas its contribution to $d_{lbl}$ is $0$ because $l(x_1) = l(m(x_1)) = dup$. On the other hand, the node $y_0$ from $G_2$ contributes $0$ to $d_{path}$ since its correspondent $x_0$ is mapped to the same species, but contributes $1$ to $d_{lbl}$ since $l(y_0) = dup$ and $l(x_0) = spec$.
  • Figure 2: Two different reconciled gene trees $\mathcal{G}_1, \mathcal{G}_2$, where redundant edges are bold (again, lowercase letters indicate the species). Their $d_{plr}$ value is $0$ (one can check that all duplications in species $W \in \{A, X, B\}$ in either tree maps to a duplication in the same $W$ in the other tree, and the $X$ speciation to an $X$ speciation. On the right, the least duplication-resolved version of the trees, showing that $\mathcal{G}_1 \simeq_d \mathcal{G}_2$.
  • Figure 3: A species tree $S$ and reconciled gene trees $\mathcal{G}_1, \mathcal{G}_2, \mathcal{G}_3$ that violate the triangle inequality.
  • Figure 4: An example of two labeled trees (left and right), with $n = 5$ leaves and two internal edges, which both need to be contracted. To achieve this under the ELRF distance, we can perform $\lfloor (n - 2)/2 \rfloor = 1$ relabeling to make every label a circle (not shown), then contract every internal edge to obtain a star tree (second drawing). We can then change the remaining label, and reverse the operations to obtain the right tree. This takes $7 = 3n - 8$ operations.
  • Figure 5: Distributions of the PLR, ELRF, LRF, and RF metrics for datasets $\Gamma_{10,20}$, $\Gamma_{25,10}$, and $\Gamma_{50,5}$, from top to bottom rows, respectively, and alpha values from the set $\{\frac{1}{n}, 0.25, 0.5, 0.75\}$, with $n$ as number of species. Each row corresponds to a dataset, while each column represents a different value of $\alpha$. The $x$-axis represents max-normalized values ranging from $0$ to $1$, and the $y$-axis is the frequency of these values. The PLR measure in purple shows a centered and symmetric distribution with a broader spread. The ELRF, LRF, and RF metrics, shown in light orange, green, and red, respectively, exhibit right-skewed distributions towards the higher end of the scale.
  • ...and 2 more figures

Theorems & Definitions (17)

  • Lemma 1
  • proof
  • Theorem 1
  • proof
  • Lemma 2
  • proof
  • Corollary 1
  • Theorem 2
  • proof
  • Proposition 1
  • ...and 7 more