Table of Contents
Fetching ...

Accurate RNA 3D structure prediction using a language model-based deep learning approach

Tao Shen, Zhihang Hu, Siqi Sun, Di Liu, Felix Wong, Jiuming Wang, Jiayang Chen, Yixuan Wang, Liang Hong, Jin Xiao, Liangzhen Zheng, Tejas Krishnamoorthi, Irwin King, Sheng Wang, Peng Yin, James J. Collins, Yu Li

TL;DR

This study introduces RhoFold+, a fully automated, language model–guided framework for de novo RNA 3D structure prediction. By integrating an RNA foundation model trained on ~23.7 million sequences with MSA features and a geometry-aware IPA-based structure module, it achieves fast, high-accuracy predictions and demonstrated generalization across unseen RNA types and new structures. It outperforms multiple existing methods on RNA-Puzzles and CASP15 natural targets, while also enabling accurate secondary structure predictions and meaningful inter-helical angle analyses for construct design. The work highlights the potential of RNA-focused foundation models to accelerate RNA structure determination and design, with practical implications for drug targeting and synthetic biology.

Abstract

Accurate prediction of RNA three-dimensional (3D) structure remains an unsolved challenge. Determining RNA 3D structures is crucial for understanding their functions and informing RNA-targeting drug development and synthetic biology design. The structural flexibility of RNA, which leads to scarcity of experimentally determined data, complicates computational prediction efforts. Here, we present RhoFold+, an RNA language model-based deep learning method that accurately predicts 3D structures of single-chain RNAs from sequences. By integrating an RNA language model pre-trained on ~23.7 million RNA sequences and leveraging techniques to address data scarcity, RhoFold+ offers a fully automated end-to-end pipeline for RNA 3D structure prediction. Retrospective evaluations on RNA-Puzzles and CASP15 natural RNA targets demonstrate RhoFold+'s superiority over existing methods, including human expert groups. Its efficacy and generalizability are further validated through cross-family and cross-type assessments, as well as time-censored benchmarks. Additionally, RhoFold+ predicts RNA secondary structures and inter-helical angles, providing empirically verifiable features that broaden its applicability to RNA structure and function studies.

Accurate RNA 3D structure prediction using a language model-based deep learning approach

TL;DR

This study introduces RhoFold+, a fully automated, language model–guided framework for de novo RNA 3D structure prediction. By integrating an RNA foundation model trained on ~23.7 million sequences with MSA features and a geometry-aware IPA-based structure module, it achieves fast, high-accuracy predictions and demonstrated generalization across unseen RNA types and new structures. It outperforms multiple existing methods on RNA-Puzzles and CASP15 natural targets, while also enabling accurate secondary structure predictions and meaningful inter-helical angle analyses for construct design. The work highlights the potential of RNA-focused foundation models to accelerate RNA structure determination and design, with practical implications for drug targeting and synthetic biology.

Abstract

Accurate prediction of RNA three-dimensional (3D) structure remains an unsolved challenge. Determining RNA 3D structures is crucial for understanding their functions and informing RNA-targeting drug development and synthetic biology design. The structural flexibility of RNA, which leads to scarcity of experimentally determined data, complicates computational prediction efforts. Here, we present RhoFold+, an RNA language model-based deep learning method that accurately predicts 3D structures of single-chain RNAs from sequences. By integrating an RNA language model pre-trained on ~23.7 million RNA sequences and leveraging techniques to address data scarcity, RhoFold+ offers a fully automated end-to-end pipeline for RNA 3D structure prediction. Retrospective evaluations on RNA-Puzzles and CASP15 natural RNA targets demonstrate RhoFold+'s superiority over existing methods, including human expert groups. Its efficacy and generalizability are further validated through cross-family and cross-type assessments, as well as time-censored benchmarks. Additionally, RhoFold+ predicts RNA secondary structures and inter-helical angles, providing empirically verifiable features that broaden its applicability to RNA structure and function studies.
Paper Structure (12 sections, 3 equations, 5 figures)

This paper contains 12 sections, 3 equations, 5 figures.

Figures (5)

  • Figure 1: The architecture of RhoFold+ and the tasks used for performance evaluation. a. The architecture of RhoFold+, a fully automated and differentiable end-to-end approach to de novo RNA 3D structure prediction from sequence. Using an RNA language model (RNA-FM) pre-trained on 23,735,169 unannotated RNA sequences, and several deep learning modules---including an invariant point attention (IPA) module which models 3D positions---RhoFold+ can generate valid and largely accurate RNA 3D structures of interest typically within $\sim$0.14 seconds (w/o MSA searching). b. The preprocessing step of RhoFold+ to extract all available non-redundant single-stranded RNA 3D structures from the PDB database. RhoFold+ is comprehensively benchmarked on community-wide challenges including RNA-Puzzles targets and CASP15 natural RNA targets, and on all available experimentally determined RNA 3D structures. RhoFold+ also demonstrates high accuracy in cross-validation experiments, as well as generalizability to unseen, newly determined RNA structures and unseen RNA families and types in cross-family and cross-type validation experiments. Data-split evaluations reveal that RhoFold+ does not overfit its training set. RhoFold+ is also capable of predicting secondary structures and parameters that are useful for construct engineering.
  • Figure 2: Benchmarking RhoFold+ on previously held community-wide challenges. The central curve in the c, g, and j panels represents the fitted regression model, while the two surrounding curves indicate the 95% percentile intervals. a. RMSD performance scatterplot of RhoFold+ and other methods across 24 non-overlapping, non-redundant RNA-Puzzles targets. Each point represents a predicted model from a specific method. b. Visualization of RNA-Puzzles 7 and 38. In addition to the aligned RhoFold+ prediction, we show the most similar training structure with respect to each target, suggesting that RhoFold+ neither overfits the training set nor simply reproduces the most similar structure to the target. c. Regression plot of the TM-score and lDDT of RhoFold+'s predictions against the maximum sequence similarity among all the training sequences, across all RNA-Puzzles targets. Each point represents an RNA-Puzzles target. d. Running time comparison for different methods. e. Comparison of RhoFold+'s predictions against the respective best single templates from our training set across all RNA-Puzzles targets. f. Regression plot for C1' RMSD against atom-level pLDDT across all RNA-Puzzles and CASP15 targets. g. Regression plot for structure GDT-TS against MSA similarity across all RNA-Puzzles and CASP15 targets. h. Detailed performance comparison for CASP15 natural RNA targets. The pink columns record detailed RMSD values and the blue columns record the sum of Z-scores for GDT-TS and TM-score. i. Comparison of RhoFold+'s average performance against the average reported performance of CASP15 groups and published works on CASP15 natural RNA targets. j. Regression plot for structure GDT-TS and lDDT against sequence length across all CASP15 targets. k. Comparison of RhoFold+'s predictions against AIchemy_RNA2 and UltraFold on the R1116 target from CASP15. l. For the R1156 target, showing one of RhoFold+'s potential failure cases involving incorrect stacking patterns and orientations.
  • Figure 3: Benchmarking RhoFold+ on all experimentally determined RNA structures supports RhoFold+'s accuracy and ability to generalize to unseen structures. The central curve in the b and h panels represents the fitted regression model, while the two surrounding curves indicate the 95% percentile intervals. a-d. Ten-fold cross-validation of RhoFold+ using all experimentally determined RNA structures. a. Plot of RMSD values against sequence length for all cross-validation experiments. Each point represents an RNA structure and is colored according to the cross-validation fold. b. Regression analysis for each prediction's TM-score (blue) and lDDT (pink) against the maximum sequence similarity with respect to all training data. Each point represents an RNA structure. c. Average TM-score and lDDT for each fold. d. Visualization of two representative riboswitch structures, 6UES and 3UD4, and a pseudoknot 1DDY (pink), along with the corresponding RhoFold+ predictions (slate) and the training RNA structures with the highest sequence similarity (cyan). e. Visualization of a newly determined RNA structure, 7QR3, which has a low structural similarity with respect to the training set, but whose structure (pink) is accurately predicted by RhoFold+ (slate). The most similar structure, 7DLZ, is shown in cyan. f. Comparison of average RSMD values generated by RhoFold+ and other methods on the new PDB set, a set of 76 newly determined solo RNA structures. g. Regression plot of the prediction RMSD values against maximum sequence similarity to the training set for RhoFold+ and other baseline methods. h. Regression plot of the correlation between RhoFold+ predictions' TM-score/lDDT and the maximum MSA profile similarity against the training set. i. Overview of cross-type validation performance of RhoFold+ measured by lDDT and TM-score. All structures in the type used for validation were masked during model training. j. Violin plot of RhoFold+'s RMSD values in the cross-family validation. Here, all the structures in a family to be tested were masked during model training, and RhoFold+ accurately predicted RNA structures from most unseen families. The numbers of sequences in each family are shown in parentheses.
  • Figure 4: RhoFold+ accurately predicts secondary structures and inter-helical angles from experimental data. The central curve in the e, j, and k panels represents the fitted regression model, while the two surrounding curves indicate the 95% percentile intervals. a. F1-score comparison against multiple configurations of UFold on the PDB set. Here, a version of UFold trained on bpRNA is also presented as a baseline, in order to evaluate the improvement in terms of F1-score. b. F1-score distribution of various methods on the ArchiveII dataset. Average scores are indicated at the top of the plot. c. F1-score comparison between RhoFold+ and UFold on the ArchiveII dataset. Each point represents an RNA structure and is colored according to its RNA type. d. F1-score comparison of RhoFold+ vs. UFold and SPOT-RNA on RNA substructures in the new PDB set. e. F1-score comparison of RhoFold+ vs. UFold and SPOT-RNA against sequence similarity of RNA structures in the new PDB set. f. Visualization of a CASP15 RNA target where RhoFold+ predicted the correct secondary structures including pseudoknots. g. Visualization of a swapped dimer, 3SUH, for which RhoFold+'s prediction (purple) resembles the biologically meaningful structure (orange) instead of the crystallographic artifact found in the PDB (pink). h. Visualization showing the definition of the inter-helical angle difference (IHAD), which is the difference between the inter-helical angles (IHAs) derived from RhoFold+'s prediction and the experimentally determined structure. i. Regression analysis between the IHAD and RMSD of RhoFold+'s predictions. Each point represents an RNA. j. Comparison between the IHAs derived from RhoFold+'s predictions against those from experimental structures. Each point represents an angle instance and is colored according to the RMSD between the experimental structure containing the angle and the structure predicted by RhoFold+. k. Plot of the IHAD against experimentally determined IHA values. The coloring is the same as in j.
  • Figure 5: Ablation studies of RhoFold+ and sampling of multiple models. The central curve in the e and f panels represents the fitted regression model, while the two surrounding curves indicate the 95% percentile intervals. a-c. Ablation studies of RhoFold+ using the Ablation set (excluding complexes). a. Ablation studies of RhoFold+ after removing corresponding modules in RhoFold+ with performance measured by RMSD. b. Regression analysis for prediction accuracy (measured by RMSD) against the reciprocal of sequence similarity. c. A regression analysis of the TM-score against MSA depth for the ablation study of the RNA-FM module. Note that the x-axis is log-scaled. d-f. Detailed studies on the impact of MSAs. d. Plot of prediction accuracy (measured by the TM-score) against MSA depth. e. Plot of the improvement of RhoFold+ against RhoFold (measured by RMSD) across different MSA depths. f. Plot of the improvement of RhoFold+ against RhoFold (measured by RMSD) across different MSA profile similarities. g-h. Examples of RhoFold+'s multiple models. g. Visualization of a CASP15 target where RhoFold+ produces an RMSD of 12.51 Å, but improves by 8.92Å using Top5 prediction from MSA sampling. h. Visualization of a newly determined RNA structure where RhoFold+'s RMSD improves by 7.92Å using Top5 prediction from MSA sampling.