Table of Contents
Fetching ...

LoRA-BERT: a Natural Language Processing Model for Robust and Accurate Prediction of long non-coding RNAs

Nicholas Jeon, Xiaoning Qian, Lamin SaidyKhan, Paul de Figueiredo, Byung-Jun Yoon

TL;DR

LoRA-BERT achieves state-of-the-art performance in predicting both lncRNAs and mRNAs for human and mouse species and acquire valuable insights into the traits of lncRNAs and mRNAs, offering the potential to aid in the comprehension and detection of diseases linked to lncRNAs in humans.

Abstract

Long non-coding RNAs (lncRNAs) serve as crucial regulators in numerous biological processes. Although they share sequence similarities with messenger RNAs (mRNAs), lncRNAs perform entirely different roles, providing new avenues for biological research. The emergence of next-generation sequencing technologies has greatly advanced the detection and identification of lncRNA transcripts and deep learning-based approaches have been introduced to classify long non-coding RNAs (lncRNAs). These advanced methods have significantly enhanced the efficiency of identifying lncRNAs. However, many of these methods are devoid of robustness and accuracy due to the extended length of the sequences involved. To tackle this issue, we have introduced a novel pre-trained bidirectional encoder representation called LoRA-BERT. LoRA-BERT is designed to capture the importance of nucleotide-level information during sequence classification, leading to more robust and satisfactory outcomes. In a comprehensive comparison with commonly used sequence prediction tools, we have demonstrated that LoRA-BERT outperforms them in terms of accuracy and efficiency. Our results indicate that, when utilizing the transformer model, LoRA-BERT achieves state-of-the-art performance in predicting both lncRNAs and mRNAs for human and mouse species. Through the utilization of LoRA-BERT, we acquire valuable insights into the traits of lncRNAs and mRNAs, offering the potential to aid in the comprehension and detection of diseases linked to lncRNAs in humans.

LoRA-BERT: a Natural Language Processing Model for Robust and Accurate Prediction of long non-coding RNAs

TL;DR

LoRA-BERT achieves state-of-the-art performance in predicting both lncRNAs and mRNAs for human and mouse species and acquire valuable insights into the traits of lncRNAs and mRNAs, offering the potential to aid in the comprehension and detection of diseases linked to lncRNAs in humans.

Abstract

Long non-coding RNAs (lncRNAs) serve as crucial regulators in numerous biological processes. Although they share sequence similarities with messenger RNAs (mRNAs), lncRNAs perform entirely different roles, providing new avenues for biological research. The emergence of next-generation sequencing technologies has greatly advanced the detection and identification of lncRNA transcripts and deep learning-based approaches have been introduced to classify long non-coding RNAs (lncRNAs). These advanced methods have significantly enhanced the efficiency of identifying lncRNAs. However, many of these methods are devoid of robustness and accuracy due to the extended length of the sequences involved. To tackle this issue, we have introduced a novel pre-trained bidirectional encoder representation called LoRA-BERT. LoRA-BERT is designed to capture the importance of nucleotide-level information during sequence classification, leading to more robust and satisfactory outcomes. In a comprehensive comparison with commonly used sequence prediction tools, we have demonstrated that LoRA-BERT outperforms them in terms of accuracy and efficiency. Our results indicate that, when utilizing the transformer model, LoRA-BERT achieves state-of-the-art performance in predicting both lncRNAs and mRNAs for human and mouse species. Through the utilization of LoRA-BERT, we acquire valuable insights into the traits of lncRNAs and mRNAs, offering the potential to aid in the comprehension and detection of diseases linked to lncRNAs in humans.

Paper Structure

This paper contains 14 sections, 8 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The model's schematic diagram. First, we use feature extraction to partition each input RNA sequence. Subsequently, with the complete set of $k$-mer sequences, the tokenization layer is responsible for acquiring embedding vectors for all these $k$-mers. This layer converts all the $k$-mer sequences into a matrix in the continuous vector space. The model consists of 12 BERT layers with multi-head self-attention, two layer norm, and feed-forward. The last layer, classification layer, calculates the probability and the result of the classification for the input sequence, determining whether it falls into a positive or negative class.
  • Figure 2: LoRA-BERT model architecture: (a) LoRA-BERT uses a tokenized feature extracted sequence as input, which contains classification and separate tokens. The tokenized sequence goes through embedding layers and passes 12 transformer layers. We utilized the initial output from the last hidden states for sequence-level classification. (b) The Multi-Head Attention architecture comprises multiple attention score layers operating simultaneously in parallel.
  • Figure 3: The comparison with different models for ROC curve of (a) human and (b) mouse, with TPR as the vertical axis and FPR as the horizontal axis.
  • Figure 4: The comparison with different models for ROC curve for different partial ratios for human species. (a) represent 90% of partial sequence ROC curve, (b) represent 80% of partial sequence ROC, (c) represent 70% of partial sequence ROC curve, and (d) represent 60% of partial sequence ROC curve.
  • Figure 5: The comparison with different models for ROC curve for different partial ratios for mouse species. (a) represent 90% of partial sequence ROC curve, (b) represent 80% of partial sequence ROC, (c) represent 70% of partial sequence ROC curve, and (d) represent 60% of partial sequence ROC curve.