Table of Contents
Fetching ...

Comprehensive benchmarking of large language models for RNA secondary structure prediction

L. I. Zablocki, L. A. Bugnon, M. Gerard, L. Di Persia, G. Stegmayer, D. H. Milone

TL;DR

A comprehensive experimental and comparative analysis of pretrained RNA-LLM that have been recently proposed are presented and it is shown that two LLMs clearly outperform the other models, and revealed significant challenges for generalization in low-homology scenarios.

Abstract

Inspired by the success of large language models (LLM) for DNA and proteins, several LLM for RNA have been developed recently. RNA-LLM uses large datasets of RNA sequences to learn, in a self-supervised way, how to represent each RNA base with a semantically rich numerical vector. This is done under the hypothesis that obtaining high-quality RNA representations can enhance data-costly downstream tasks. Among them, predicting the secondary structure is a fundamental task for uncovering RNA functional mechanisms. In this work we present a comprehensive experimental analysis of several pre-trained RNA-LLM, comparing them for the RNA secondary structure prediction task in an unified deep learning framework. The RNA-LLM were assessed with increasing generalization difficulty on benchmark datasets. Results showed that two LLM clearly outperform the other models, and revealed significant challenges for generalization in low-homology scenarios.

Comprehensive benchmarking of large language models for RNA secondary structure prediction

TL;DR

A comprehensive experimental and comparative analysis of pretrained RNA-LLM that have been recently proposed are presented and it is shown that two LLMs clearly outperform the other models, and revealed significant challenges for generalization in low-homology scenarios.

Abstract

Inspired by the success of large language models (LLM) for DNA and proteins, several LLM for RNA have been developed recently. RNA-LLM uses large datasets of RNA sequences to learn, in a self-supervised way, how to represent each RNA base with a semantically rich numerical vector. This is done under the hypothesis that obtaining high-quality RNA representations can enhance data-costly downstream tasks. Among them, predicting the secondary structure is a fundamental task for uncovering RNA functional mechanisms. In this work we present a comprehensive experimental analysis of several pre-trained RNA-LLM, comparing them for the RNA secondary structure prediction task in an unified deep learning framework. The RNA-LLM were assessed with increasing generalization difficulty on benchmark datasets. Results showed that two LLM clearly outperform the other models, and revealed significant challenges for generalization in low-homology scenarios.

Paper Structure

This paper contains 13 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: RNA-LLM embeddings and deep neural architecture of the prediction model. a, Flow diagram for RNA secondary structure predictions. Each LLM was downloaded from its official repository, frozen and used to get per-nucleotide embeddings for each sequence in the benchmark datasets. The embeddings go through a fully connected layer, outer concatenation, two 2D ResNet blocks and a final 2D convolution. This flow is explained in detail in aligned panels b-e. b, One-hot encoding, of size $4 \times L$, and LLM representation with size $d \times L$, where $d$ is the embedding dimension and $L$ is the padded sequence length. A fully connected layer reduces the input dimension to $M/2$. c, The $M/2 \times L$ projection is transformed to a $M \times L \times L$ tensor using outer concatenation. The position $(i, j)$ contains the concatenated representation of nucleotides $i$ and $j$. d, Then, there are two 2D ResNet blocks with 2D Convolution, Instance Normalization and ReLU. ResNet Blocks 1 has kernel size 1 and ResNet Blocks 2 has kernel size 3. e, A final convolution yields the output scores as a $L \times L$ connection matrix. f-l, RNA-LLM embeddings projected with UMAP for dimensionality reduction. The RNA families of the ArchiveII dataset are highlighted with different colors. Sequences within the same RNA family are expected to be close in the dimensionally reduced space.
  • Figure 2: Comparative results among RNA-LLM on the RNA secondary structure prediction task for different benchmark datasets of increasing complexity. Each method has a different color. The thermodynamic prediction method (LinearPartition-V, dashed black) and the DL-based prediction method (sincFold, solid black line) are added as baselines. Next to RNA-LLM performances, sequence length distribution for each dataset is shown in blue. a, ArchiveII 5-fold random cross-validation. A Wilcoxon test for paired samples with Bonferroni correction indicates that all differences are statistically significant ($P<0.0001$, $N=3,864$). b, bpRNA train-test partitions with controlled homology. All differences are statistically significant ($P<0.0001$, $N=1,305$) except for one-hot and RNABERT. c, bpRNA-new dataset, for RNA families not seen during training. All differences are statistically significant ($P<0.0001$, except for RNAErnie versus RNA-MSM and one-hot with $P<0.05$, $N=5,401$). d, PDB-RNA dataset, with RNA sequences extracted from PDB. Most differences are not statistically significant among RNA-LLM, except for RiNALMo and ERNIE-RNA with respect to RNA-FM, RNABERT and RNAErnie. Details of statistical analysis in Supplementary Fig. S1a-d. e, Average $F_1$ per motif type on bpRNA-new. f, Performance accounting only non-canonical interactions on PDB-RNA.
  • Figure 3: Inter-family structure prediction based on RNA LLM. Performance evaluated on the 9 RNA families of the ArchiveII dataset. Each boxplot represents the $F_1$ performance of all methods for a given family in the test set. The thermodynamic prediction method (LinearPartition-V, dashed black) and the DL-based method (sincFold, solid black line) are added as baselines in all plots. Below RNA-LLM performances, sequence length distribution for each dataset is shown in blue and distribution of minimum test-train structural distance Reuter2010 is shown in orange. a, tRNA family. A Wilcoxon test for paired samples with Bonferroni correction indicates that all differences are statistically significant ($P<0.0001$, $N=557$), except in the case of RiNALMo with ERNIE-RNA, and RNABERT with one-hot. b, 5s family. All differences are statistically significant ($P<0.0001$, $N=1283$). c, tmRNA family. All differences are statistically significant ($P<0.0001$, $N=462$). d, RNaseP family. Only ERNIE-RNA with RiNALMo, RNA-MSM with RNAErnie, and one-hot with RNABERT are not statistically significant. e, srp family. Differences for the top 4 methods are statistically significant ($P<0.0001$, $N=918$). f, grp1 family. Differences for the top 2 methods are statistically significant ($P<0.0001$, $N=74$). g, 23s family. Top 2 methods are significantly better than the rest of all the methods ($P<0.01$, $N=15$). h, 16s family. RiNALMo and ERNIE-RNA are statistically different and are also statistically different from the rest of the methods ($P<0.0001$, $N=66$). i, telomerase family. ERNIE-RNA and one-hot are statistically different from RNABERT, RiNALMo, RNA-MSM and RNAErnie. Details of statistical analysis in Supplementary Fig. S2.