Median and Small Parsimony Problems on RNA trees
Bertrand Marchand, Yoann Anselmetti, Manuel Lafond, Aïda Ouangraoua
TL;DR
This work tackles the reconstruction of ancestral RNA secondary structures by formulating median and small parsimony problems on RNA trees under three distance measures: Robinson-Foulds (RF), Internal-Leafset (IL), and Relaxed Edit (RE). It establishes polynomial-time solutions for several variants (notably RF_NC and IL_ILC, RF_ILC medians; RF_NC small parsimony) and develops a novel dynamic-programming framework for IL medians based on structural partitions. The study reveals that constraint choices strongly affect the inferred ancestral structure resolution, with IL_ILC and RF_ILC yielding the most detailed reconstructions, while RF_NC tends to produce sparser ancestries. Experimental results on RFAM-derived and randomized datasets validate the practical utility of the methods and highlight open challenges for RE medians and certain IL/RE small parsimony variants. Overall, the framework enables structured, multi-metric analysis of RNA structure evolution, aiding exploration of RNA clan distinctions and the Ancient RNA World hypothesis.
Abstract
Motivation: Non-coding RNAs (ncRNAs) express their functions by adopting molecular structures. Specifically, RNA secondary structures serve as a relatively stable intermediate step before tertiary structures, offering a reliable signature of molecular function. Consequently, within an RNA functional family, secondary structures are generally more evolutionarily conserved than sequences. Conversely, homologous RNA families grouped within an RNA clan share ancestors but typically exhibit structural differences. Inferring the evolution of RNA structures within RNA families and clans is crucial for gaining insights into functional adaptations over time and providing clues about the Ancient RNA World Hypothesis. Results: We introduce the median problem and the small parsimony problem for ncRNA families, where secondary structures are represented as leaf-labelled trees. We utilize the Robinson-Foulds (RF) tree distance, which corresponds to a specific edit distance between RNA trees, and a new metric called the Internal-Leafset (IL) distance. While the RF tree distance compares sets of leaves descending from internal nodes of two RNA trees, the IL distance compares the collection of leaf-children of internal nodes. The latter is better at capturing differences in structural elements of RNAs than the RF distance, which is more focused on base pairs. We also consider a more general tree edit distance that allows the mapping of base pairs that are not perfectly aligned. We study the theoretical complexity of the median problem and the small parsimony problem under the three distance metrics and various biologically-relevant constraints, and we present polynomial-time maximum parsimony algorithms for solving some versions of the problems. Our algorithms are applied to ncRNA families from the RFAM database, illustrating their practical utility
