Table of Contents
Fetching ...

LinearAlifold: Linear-Time Consensus Structure Prediction for RNA Alignments

Apoorv Malik, Liang Zhang, Milan Gautam, Ning Dai, Sizhen Li, He Zhang, David H. Mathews, Liang Huang

TL;DR

RNA consensus structure prediction from MSAs is computationally expensive, with RNAalifold scaling as $O(k n^3)$ and struggling with long genomes (e.g., ~$n\approx 3\times 10^4$). LinearAlifold introduces a linear-time consensus folding approach by applying beam search to prune states, reducing inside-phase complexity to $O(k n b^2)$ and $O(k n b \log b)$ for MFE, with a default beam size $b=100$, plus a LazyOutside outside phase to compute base-pairing probabilities efficiently. It offers four output modalities (MFE, MEA, ThreshKnot, and alifold-aware stochastic sampling) and supports two energy models, Vienna and BL*, with BL* as the default, achieving higher accuracy than baselines on benchmark datasets and strong agreement with experimental SARS-CoV-2 structures. The approach scales to hundreds of genomes and long RNAs, delivering orders-of-magnitude speedups and enabling rapid, large-scale consensus structure analyses for viral genomes, accompanied by a public web server and open-source code.

Abstract

Predicting the consensus structure of a set of aligned RNA homologs is a convenient method to find conserved structures in an RNA genome, which has many applications including viral diagnostics and therapeutics. However, the most commonly used tool for this task, RNAalifold, is prohibitively slow for long sequences, due to a cubic scaling with the sequence length, taking over a day on 400 SARS-CoV-2 and SARS-related genomes (~30,000nt). We present LinearAlifold, a much faster alternative that scales linearly with both the sequence length and the number of sequences, based on our work LinearFold that folds a single RNA in linear time. Our work is orders of magnitude faster than RNAalifold (0.7 hours on the above 400 genomes, or ~36$\times$ speedup) and achieves higher accuracies when compared to a database of known structures. More interestingly, LinearAlifold's prediction on SARS-CoV-2 correlates well with experimentally determined structures, substantially outperforming RNAalifold. Finally, LinearAlifold supports two energy models (Vienna and BL*) and four modes: minimum free energy (MFE), maximum expected accuracy (MEA), ThreshKnot, and stochastic sampling, each of which takes under an hour for hundreds of SARS-CoV variants. Our resource is at: https://github.com/LinearFold/LinearAlifold (code) and http://linearfold.org/linear-alifold (server).

LinearAlifold: Linear-Time Consensus Structure Prediction for RNA Alignments

TL;DR

RNA consensus structure prediction from MSAs is computationally expensive, with RNAalifold scaling as and struggling with long genomes (e.g., ~). LinearAlifold introduces a linear-time consensus folding approach by applying beam search to prune states, reducing inside-phase complexity to and for MFE, with a default beam size , plus a LazyOutside outside phase to compute base-pairing probabilities efficiently. It offers four output modalities (MFE, MEA, ThreshKnot, and alifold-aware stochastic sampling) and supports two energy models, Vienna and BL*, with BL* as the default, achieving higher accuracy than baselines on benchmark datasets and strong agreement with experimental SARS-CoV-2 structures. The approach scales to hundreds of genomes and long RNAs, delivering orders-of-magnitude speedups and enabling rapid, large-scale consensus structure analyses for viral genomes, accompanied by a public web server and open-source code.

Abstract

Predicting the consensus structure of a set of aligned RNA homologs is a convenient method to find conserved structures in an RNA genome, which has many applications including viral diagnostics and therapeutics. However, the most commonly used tool for this task, RNAalifold, is prohibitively slow for long sequences, due to a cubic scaling with the sequence length, taking over a day on 400 SARS-CoV-2 and SARS-related genomes (~30,000nt). We present LinearAlifold, a much faster alternative that scales linearly with both the sequence length and the number of sequences, based on our work LinearFold that folds a single RNA in linear time. Our work is orders of magnitude faster than RNAalifold (0.7 hours on the above 400 genomes, or ~36 speedup) and achieves higher accuracies when compared to a database of known structures. More interestingly, LinearAlifold's prediction on SARS-CoV-2 correlates well with experimentally determined structures, substantially outperforming RNAalifold. Finally, LinearAlifold supports two energy models (Vienna and BL*) and four modes: minimum free energy (MFE), maximum expected accuracy (MEA), ThreshKnot, and stochastic sampling, each of which takes under an hour for hundreds of SARS-CoV variants. Our resource is at: https://github.com/LinearFold/LinearAlifold (code) and http://linearfold.org/linear-alifold (server).
Paper Structure (1 section, 12 equations, 8 figures, 2 tables)

This paper contains 1 section, 12 equations, 8 figures, 2 tables.

Table of Contents

  1. Partition Function Mode

Figures (8)

  • Figure 2: Accuracy comparisons between RNAalifold and LinearAlifold; each family has 10 samples and each sample is has $k=30$ homologs. Statistical significance (two-sided) is marked as '$\uparrow$' if LinearAlifold is significantly better, or '$\downarrow$' if RNAalifold is significantly better ($p \!<\!0.05$). See also Fig. \ref{['fig:si-accuracy-10']}.
  • Figure 3: Structural distance and ensemble defect against run time for different energy models and different methods. The curves show the mean values over 10 samples for each $k$ A--B: MFE prediction. C--D: partition-based structure prediction. E--F: ensemble quality. See Fig. \ref{['fig:covid-si']} for another version which shows more statistics of each 10 samples and uses $k$ as the x-axis.
  • Figure 4: Visualizations of structure predictions on $k=30$ SARS-CoV genomes (A--G) compared with the experimentally-guided hybrid structure (H). A--C: Circular plots of base-pairing probabilities (BPPs) from LinearAlifold (two energy models) and LinearTurboFold on $k=30$ genomes (sample 5/10). Blue arcs are consistent with at least one range from Ziv et al. ziv+:2020, while red arcs are not supported by any such range. The darkness of the arcs indicates pairing probability. D--F: stochastic sampling statistics (over 10,000 structures) between the competing global (arch 3 from Ziv et al.) and local (SL3 from Huston et al. huston+:2021) structures. G: the 5' and 3' UTR structures of LinearAlifold (BL*) ThreshKnot prediction, with shades of blue for unpaired probabilities of each nucleotide and shades of black for pairing probabilities for each pair. H: the reference hybrid structure based on Huston et al.'s SHAPE-guided model but with the end-to-end arch 3 from Ziv et al. replacing SL3.
  • Figure S1: Accuracy comparisons on the RNAstralign dataset, similar to Fig. \ref{['fig:accuracy']} but including more systems. Each family has 10 samples, and each sample is an MSA with $k=30$ homologs. Align-then-fold systems (RNAalifold, LinearAlifold, and LinAliFold) tend to be inaccurate for low sequence indentity families (e.g., SRP and group 1) and tend to be more accurate for high sequence identity families (e.g., 16S rRNA). Refer to Fig. \ref{['fig:si-accuracy-20']} for a similar figure with 20 samples per family.
  • Figure S2: Accuracy comparisons on the RNAstralign dataset. Each family has 20 samples, and each sample is an MSA with $k=30$ homologs. Align-then-fold systems (RNAalifold, LinearAlifold, and LinAliFold) tend to be inaccurate for low sequence indentity families (e.g., SRP and group 1) and tend to be more accurate for high sequence identity families (e.g., 16S rRNA). Refer to Fig. \ref{['fig:si-accuracy-10']} for a similar figure with 10 samples per family.
  • ...and 3 more figures