Table of Contents
Fetching ...

Median and Small Parsimony Problems on RNA trees

Bertrand Marchand, Yoann Anselmetti, Manuel Lafond, Aïda Ouangraoua

TL;DR

This work tackles the reconstruction of ancestral RNA secondary structures by formulating median and small parsimony problems on RNA trees under three distance measures: Robinson-Foulds (RF), Internal-Leafset (IL), and Relaxed Edit (RE). It establishes polynomial-time solutions for several variants (notably RF_NC and IL_ILC, RF_ILC medians; RF_NC small parsimony) and develops a novel dynamic-programming framework for IL medians based on structural partitions. The study reveals that constraint choices strongly affect the inferred ancestral structure resolution, with IL_ILC and RF_ILC yielding the most detailed reconstructions, while RF_NC tends to produce sparser ancestries. Experimental results on RFAM-derived and randomized datasets validate the practical utility of the methods and highlight open challenges for RE medians and certain IL/RE small parsimony variants. Overall, the framework enables structured, multi-metric analysis of RNA structure evolution, aiding exploration of RNA clan distinctions and the Ancient RNA World hypothesis.

Abstract

Motivation: Non-coding RNAs (ncRNAs) express their functions by adopting molecular structures. Specifically, RNA secondary structures serve as a relatively stable intermediate step before tertiary structures, offering a reliable signature of molecular function. Consequently, within an RNA functional family, secondary structures are generally more evolutionarily conserved than sequences. Conversely, homologous RNA families grouped within an RNA clan share ancestors but typically exhibit structural differences. Inferring the evolution of RNA structures within RNA families and clans is crucial for gaining insights into functional adaptations over time and providing clues about the Ancient RNA World Hypothesis. Results: We introduce the median problem and the small parsimony problem for ncRNA families, where secondary structures are represented as leaf-labelled trees. We utilize the Robinson-Foulds (RF) tree distance, which corresponds to a specific edit distance between RNA trees, and a new metric called the Internal-Leafset (IL) distance. While the RF tree distance compares sets of leaves descending from internal nodes of two RNA trees, the IL distance compares the collection of leaf-children of internal nodes. The latter is better at capturing differences in structural elements of RNAs than the RF distance, which is more focused on base pairs. We also consider a more general tree edit distance that allows the mapping of base pairs that are not perfectly aligned. We study the theoretical complexity of the median problem and the small parsimony problem under the three distance metrics and various biologically-relevant constraints, and we present polynomial-time maximum parsimony algorithms for solving some versions of the problems. Our algorithms are applied to ncRNA families from the RFAM database, illustrating their practical utility

Median and Small Parsimony Problems on RNA trees

TL;DR

This work tackles the reconstruction of ancestral RNA secondary structures by formulating median and small parsimony problems on RNA trees under three distance measures: Robinson-Foulds (RF), Internal-Leafset (IL), and Relaxed Edit (RE). It establishes polynomial-time solutions for several variants (notably RF_NC and IL_ILC, RF_ILC medians; RF_NC small parsimony) and develops a novel dynamic-programming framework for IL medians based on structural partitions. The study reveals that constraint choices strongly affect the inferred ancestral structure resolution, with IL_ILC and RF_ILC yielding the most detailed reconstructions, while RF_NC tends to produce sparser ancestries. Experimental results on RFAM-derived and randomized datasets validate the practical utility of the methods and highlight open challenges for RE medians and certain IL/RE small parsimony variants. Overall, the framework enables structured, multi-metric analysis of RNA structure evolution, aiding exploration of RNA clan distinctions and the Ancient RNA World hypothesis.

Abstract

Motivation: Non-coding RNAs (ncRNAs) express their functions by adopting molecular structures. Specifically, RNA secondary structures serve as a relatively stable intermediate step before tertiary structures, offering a reliable signature of molecular function. Consequently, within an RNA functional family, secondary structures are generally more evolutionarily conserved than sequences. Conversely, homologous RNA families grouped within an RNA clan share ancestors but typically exhibit structural differences. Inferring the evolution of RNA structures within RNA families and clans is crucial for gaining insights into functional adaptations over time and providing clues about the Ancient RNA World Hypothesis. Results: We introduce the median problem and the small parsimony problem for ncRNA families, where secondary structures are represented as leaf-labelled trees. We utilize the Robinson-Foulds (RF) tree distance, which corresponds to a specific edit distance between RNA trees, and a new metric called the Internal-Leafset (IL) distance. While the RF tree distance compares sets of leaves descending from internal nodes of two RNA trees, the IL distance compares the collection of leaf-children of internal nodes. The latter is better at capturing differences in structural elements of RNAs than the RF distance, which is more focused on base pairs. We also consider a more general tree edit distance that allows the mapping of base pairs that are not perfectly aligned. We study the theoretical complexity of the median problem and the small parsimony problem under the three distance metrics and various biologically-relevant constraints, and we present polynomial-time maximum parsimony algorithms for solving some versions of the problems. Our algorithms are applied to ncRNA families from the RFAM database, illustrating their practical utility
Paper Structure (26 sections, 11 theorems, 19 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 26 sections, 11 theorems, 19 equations, 5 figures, 1 table, 1 algorithm.

Key Result

Proposition 1

Given an RNA secondary structure $R=(S,P)$ with $S$ of length $n$, the tree $T(R)$ is an RNA tree with leafset $[0,n+1]$. Given an RNA tree $T$ with leafset $L(T)=[0,n+1]$, $R= (S,P)$ where $S$ is a string of length $n$ on the alphabet $\Sigma$ and $P=I(T)-\{(0,n+1)\}$ is an RNA secondary structure.

Figures (5)

  • Figure 1: (Top) An RNA structure given in dot-bracket notation, and (Bottom) the corresponding tree representation, as defined formally in Definition \ref{['def:rna_tree']}.
  • Figure 2: (Left) The Small Parsimony Problem consists in finding the best assignment of RNA structures to internal nodes of a given phylogeny, minimizing the sum of distances over edges. (Right) The Median Problem consists in finding a single RNA structure that minimizes the sum of distances to the input structures.
  • Figure 3: (A) Scatter plots of the number of base-pairs predicted at the root of the phylogeny by RF_NC, RF_ILC, IL_ILC, IL_NC and RE for all families in the FILTERED_RFAM dataset. Each point is an RFAM family, with its $x,y$-coordinates equal to the number of predicted base-pairs at the root of the phylogeny by the respective methods. For each plot, the red square point indicates the average, its error bars being the standard deviation. As in Figure \ref{['fig:rfam_num_bps_height']} (Appendix), RF_NC seems to provide the least resolved results. IL_NC performs slightly better, but RF_ILC and IL_ILC yield the most base pairs. Both of them seem to fare comparably with each other, indicating that the ILC constraint (only internal leafsets from the input structures) is the deciding factor. (B) Average maximum number of base-pairs in reconstructed ancestral structures, as a function of the height of the corresponding node in the phylogenetic tree, over the RANDOM dataset. We observe as a general trend that the number of base-pairs in predicted ancestral structures decreases as we move up the trees. However, where RF_NC very quickly predicts empty structures at ancestral nodes, the other metric/constraint combinations (RF_ILC, IL_ILC and IL_NC) do predict non-empty structures. Remarkably, IL_NC does so without any constraint.
  • Figure 4: Distribution of the Small Parsimony (SP) costs obtained over the FILTERED_RFAM (bottom) and the RANDOM dataset with random structures of length $30$ (top), by each method with respect to all three distances. The Small Parsimony cost is the sum of distances (either RF, IL, or RE) over each edge of a phylogeny. To allow averaging over several RFAM families, costs are divided by the number of edges in the phylogeny.
  • Figure 5: Maximum number of base-pairs as a function of the height of nodes in the phylogeny, for a selected set of 10 maximally-divergent RFAM families. The number of base-pairs are normalized, for each family, by the maximum number of base-pairs over the structures annotating the leaves. The selected "maximally-divergent" families are the ones maximizing the sum of distances over pairs of leaves, as measured by the Internal-Leafset distance (Definition \ref{['def:il_distance']}). Solving Small Parsimony under the metric/constraint combination RF_NC tends to yield ancestral structures with few base-pairs, as we move up the phylogenies. While also unconstrained, IL_NC tends to predict more base-pairs than RF_NC in ancestral structures. Being constrained to use only internal-leafsets from the input structures, IL_ILC and RF_ILC predict the most resolved ancestral structures, as per the criteria of the number of base-pairs. The score function difference (IL vs. RF) does not seem to have more than marginal impact. Note that RF_NC is DLC (only descendant leaf-sets from the input structure) so imposing this constraint would not help get more resolution.

Theorems & Definitions (22)

  • Definition 1: RNA tree
  • Proposition 1
  • Definition 2: Base pair distance
  • Definition 3: Robinson-Foulds distance
  • Definition 4: Internal-Leafset distance
  • Definition 5: Tree Edit distance
  • Proposition 2: Equality of BP distance and RF distance
  • Lemma 1: Equality of BP distance and TE distance under a specific cost function
  • proof
  • Definition 6: Relaxed Edit distance
  • ...and 12 more