LinearAlifold: Linear-Time Consensus Structure Prediction for RNA Alignments
Apoorv Malik, Liang Zhang, Milan Gautam, Ning Dai, Sizhen Li, He Zhang, David H. Mathews, Liang Huang
TL;DR
RNA consensus structure prediction from MSAs is computationally expensive, with RNAalifold scaling as $O(k n^3)$ and struggling with long genomes (e.g., ~$n\approx 3\times 10^4$). LinearAlifold introduces a linear-time consensus folding approach by applying beam search to prune states, reducing inside-phase complexity to $O(k n b^2)$ and $O(k n b \log b)$ for MFE, with a default beam size $b=100$, plus a LazyOutside outside phase to compute base-pairing probabilities efficiently. It offers four output modalities (MFE, MEA, ThreshKnot, and alifold-aware stochastic sampling) and supports two energy models, Vienna and BL*, with BL* as the default, achieving higher accuracy than baselines on benchmark datasets and strong agreement with experimental SARS-CoV-2 structures. The approach scales to hundreds of genomes and long RNAs, delivering orders-of-magnitude speedups and enabling rapid, large-scale consensus structure analyses for viral genomes, accompanied by a public web server and open-source code.
Abstract
Predicting the consensus structure of a set of aligned RNA homologs is a convenient method to find conserved structures in an RNA genome, which has many applications including viral diagnostics and therapeutics. However, the most commonly used tool for this task, RNAalifold, is prohibitively slow for long sequences, due to a cubic scaling with the sequence length, taking over a day on 400 SARS-CoV-2 and SARS-related genomes (~30,000nt). We present LinearAlifold, a much faster alternative that scales linearly with both the sequence length and the number of sequences, based on our work LinearFold that folds a single RNA in linear time. Our work is orders of magnitude faster than RNAalifold (0.7 hours on the above 400 genomes, or ~36$\times$ speedup) and achieves higher accuracies when compared to a database of known structures. More interestingly, LinearAlifold's prediction on SARS-CoV-2 correlates well with experimentally determined structures, substantially outperforming RNAalifold. Finally, LinearAlifold supports two energy models (Vienna and BL*) and four modes: minimum free energy (MFE), maximum expected accuracy (MEA), ThreshKnot, and stochastic sampling, each of which takes under an hour for hundreds of SARS-CoV variants. Our resource is at: https://github.com/LinearFold/LinearAlifold (code) and http://linearfold.org/linear-alifold (server).
