Table of Contents
Fetching ...

Machine Learning for RNA Secondary Structure Prediction: a review of current methods and challenges

Giuseppe Sacco, Giovanni Bussi, Guido Sanguinetti

TL;DR

RNA secondary structure prediction sits at the core of understanding RNA function and guiding therapeutics. The paper surveys the shift from thermodynamics- and grammar-based methods to data-driven deep learning, including ab initio, MSAbased, and biophysical hybrids, and highlights the emergence of RNA foundation models trained on massive unlabeled sequence data. A central finding is the generalization crisis: models perform well within training families but poorly on unseen families, driving the adoption of homology-aware benchmarking and hybrid approaches that ground learning in biophysics or evolution. The review also maps outstanding challenges—pseudoknots, kilobase-scale RNAs, chemical modifications, and environmental context—and argues for standardized prospective benchmarks to accelerate progress toward accurately modeling RNA dynamic ensembles in biologically relevant conditions.$

Abstract

Predicting the secondary structure of RNA is a core challenge in computational biology, essential for understanding molecular function and designing novel therapeutics. The field has evolved from foundational but accuracy-limited thermodynamic approaches to a new data-driven paradigm dominated by machine learning and deep learning. These models learn folding patterns directly from data, leading to significant performance gains. This review surveys the modern landscape of these methods, covering single-sequence, evolutionary-based, and hybrid models that blend machine learning with biophysics. A central theme is the field's "generalization crisis," where powerful models were found to fail on new RNA families, prompting a community-wide shift to stricter, homology-aware benchmarking. In response to the underlying challenge of data scarcity, RNA foundation models have emerged, learning from massive, unlabeled sequence corpora to improve generalization. Finally, we look ahead to the next set of major hurdles-including the accurate prediction of complex motifs like pseudoknots, scaling to kilobase-length transcripts, incorporating the chemical diversity of modified nucleotides, and shifting the prediction target from static structures to the dynamic ensembles that better capture biological function. We also highlight the need for a standardized, prospective benchmarking system to ensure unbiased validation and accelerate progress.

Machine Learning for RNA Secondary Structure Prediction: a review of current methods and challenges

TL;DR

RNA secondary structure prediction sits at the core of understanding RNA function and guiding therapeutics. The paper surveys the shift from thermodynamics- and grammar-based methods to data-driven deep learning, including ab initio, MSAbased, and biophysical hybrids, and highlights the emergence of RNA foundation models trained on massive unlabeled sequence data. A central finding is the generalization crisis: models perform well within training families but poorly on unseen families, driving the adoption of homology-aware benchmarking and hybrid approaches that ground learning in biophysics or evolution. The review also maps outstanding challenges—pseudoknots, kilobase-scale RNAs, chemical modifications, and environmental context—and argues for standardized prospective benchmarks to accelerate progress toward accurately modeling RNA dynamic ensembles in biologically relevant conditions.$

Abstract

Predicting the secondary structure of RNA is a core challenge in computational biology, essential for understanding molecular function and designing novel therapeutics. The field has evolved from foundational but accuracy-limited thermodynamic approaches to a new data-driven paradigm dominated by machine learning and deep learning. These models learn folding patterns directly from data, leading to significant performance gains. This review surveys the modern landscape of these methods, covering single-sequence, evolutionary-based, and hybrid models that blend machine learning with biophysics. A central theme is the field's "generalization crisis," where powerful models were found to fail on new RNA families, prompting a community-wide shift to stricter, homology-aware benchmarking. In response to the underlying challenge of data scarcity, RNA foundation models have emerged, learning from massive, unlabeled sequence corpora to improve generalization. Finally, we look ahead to the next set of major hurdles-including the accurate prediction of complex motifs like pseudoknots, scaling to kilobase-length transcripts, incorporating the chemical diversity of modified nucleotides, and shifting the prediction target from static structures to the dynamic ensembles that better capture biological function. We also highlight the need for a standardized, prospective benchmarking system to ensure unbiased validation and accelerate progress.

Paper Structure

This paper contains 31 sections, 3 figures.

Figures (3)

  • Figure 1: Schematic representation of thermodynamics-based RNA secondary structure prediction. The free energy of a structure is computed with the Nearest Neighbor model (left panel) as the sum of contributions from individual structural elements, enabling efficient dynamic programming algorithms to enumerate and predict the relative population of all of the possible secondary structures for a given RNA sequence (right panel). Secondary structure visualization generated with Forna kerpedjievFornaForcedirectedRNA2015.
  • Figure 2: Schematic representation of deep learning methods for RNA secondary structure prediction (not including foundation models). Dotted arrows indicate steps that are only included in training, and squared brackets indicate optional inputs. Ab initio methods predict structure from a single RNA sequence only; evolutionary methods leverage multiple sequence alignments (MSA) to capture co-evolutionary signals; hybrid methods integrate deep learning with thermodynamic models or experimental data.
  • Figure 3: Schematic representation of backbone training (above) and task-specific fine-tuning/prediction (below) for RNA foundation models. Dotted arrows indicate steps that are only included in training. During backbone training, the model learns general "RNA language" features by predicting masked nucleotides from their surrounding context on massive unlabeled sequence datasets. The pre-trained backbone can then be fine-tuned on smaller, labeled datasets for specific downstream tasks like secondary structure prediction.