Machine Learning for RNA Secondary Structure Prediction: a review of current methods and challenges
Giuseppe Sacco, Giovanni Bussi, Guido Sanguinetti
TL;DR
RNA secondary structure prediction sits at the core of understanding RNA function and guiding therapeutics. The paper surveys the shift from thermodynamics- and grammar-based methods to data-driven deep learning, including ab initio, MSAbased, and biophysical hybrids, and highlights the emergence of RNA foundation models trained on massive unlabeled sequence data. A central finding is the generalization crisis: models perform well within training families but poorly on unseen families, driving the adoption of homology-aware benchmarking and hybrid approaches that ground learning in biophysics or evolution. The review also maps outstanding challenges—pseudoknots, kilobase-scale RNAs, chemical modifications, and environmental context—and argues for standardized prospective benchmarks to accelerate progress toward accurately modeling RNA dynamic ensembles in biologically relevant conditions.$
Abstract
Predicting the secondary structure of RNA is a core challenge in computational biology, essential for understanding molecular function and designing novel therapeutics. The field has evolved from foundational but accuracy-limited thermodynamic approaches to a new data-driven paradigm dominated by machine learning and deep learning. These models learn folding patterns directly from data, leading to significant performance gains. This review surveys the modern landscape of these methods, covering single-sequence, evolutionary-based, and hybrid models that blend machine learning with biophysics. A central theme is the field's "generalization crisis," where powerful models were found to fail on new RNA families, prompting a community-wide shift to stricter, homology-aware benchmarking. In response to the underlying challenge of data scarcity, RNA foundation models have emerged, learning from massive, unlabeled sequence corpora to improve generalization. Finally, we look ahead to the next set of major hurdles-including the accurate prediction of complex motifs like pseudoknots, scaling to kilobase-length transcripts, incorporating the chemical diversity of modified nucleotides, and shifting the prediction target from static structures to the dynamic ensembles that better capture biological function. We also highlight the need for a standardized, prospective benchmarking system to ensure unbiased validation and accelerate progress.
