Table of Contents
Fetching ...

G4-Attention: Deep Learning Model with Attention for predicting DNA G-Quadruplexes

Shrimon Mukherjee, Pulakesh Pramanik, Partha Basuchowdhuri, Santanu Bhattacharya

TL;DR

This work proposes a novel convolutional neural network with Bi-LSTM and attention layers on top of a CNN architecture in a deep learning model, named G4-Attention, which is the first to incorporate Bi-LSTM and attention layers on top of a CNN architecture in a deep learning model (G4-Attention) for predicting G4-forming sequences.

Abstract

G-Quadruplexes are the four-stranded non-canonical nucleic acid secondary structures, formed by the stacking arrangement of the guanine tetramers. They are involved in a wide range of biological roles because of their exceptionally unique and distinct structural characteristics. After the completion of the human genome sequencing project, a lot of bioinformatic algorithms were introduced to predict the active G4s regions \textit{in vitro} based on the canonical G4 sequence elements, G-\textit{richness}, and G-\textit{skewness}, as well as the non-canonical sequence features. Recently, sequencing techniques like G4-seq and G4-ChIP-seq were developed to map the G4s \textit{in vitro}, and \textit{in vivo} respectively at a few hundred base resolution. Subsequently, several machine learning approaches were developed for predicting the G4 regions using the existing databases. However, their prediction models were simplistic, and the prediction accuracy was notably poor. In response, here, we propose a novel convolutional neural network with Bi-LSTM and attention layers, named G4-attention, to predict the G4 forming sequences with improved accuracy. G4-attention achieves high accuracy and attains state-of-the-art results in the G4 prediction task. Our model also predicts the G4 regions accurately in the highly class-imbalanced datasets. In addition, the developed model trained on the human genome dataset can be applied to any non-human genome DNA sequences to predict the G4 formation propensities.

G4-Attention: Deep Learning Model with Attention for predicting DNA G-Quadruplexes

TL;DR

This work proposes a novel convolutional neural network with Bi-LSTM and attention layers on top of a CNN architecture in a deep learning model, named G4-Attention, which is the first to incorporate Bi-LSTM and attention layers on top of a CNN architecture in a deep learning model (G4-Attention) for predicting G4-forming sequences.

Abstract

G-Quadruplexes are the four-stranded non-canonical nucleic acid secondary structures, formed by the stacking arrangement of the guanine tetramers. They are involved in a wide range of biological roles because of their exceptionally unique and distinct structural characteristics. After the completion of the human genome sequencing project, a lot of bioinformatic algorithms were introduced to predict the active G4s regions \textit{in vitro} based on the canonical G4 sequence elements, G-\textit{richness}, and G-\textit{skewness}, as well as the non-canonical sequence features. Recently, sequencing techniques like G4-seq and G4-ChIP-seq were developed to map the G4s \textit{in vitro}, and \textit{in vivo} respectively at a few hundred base resolution. Subsequently, several machine learning approaches were developed for predicting the G4 regions using the existing databases. However, their prediction models were simplistic, and the prediction accuracy was notably poor. In response, here, we propose a novel convolutional neural network with Bi-LSTM and attention layers, named G4-attention, to predict the G4 forming sequences with improved accuracy. G4-attention achieves high accuracy and attains state-of-the-art results in the G4 prediction task. Our model also predicts the G4 regions accurately in the highly class-imbalanced datasets. In addition, the developed model trained on the human genome dataset can be applied to any non-human genome DNA sequences to predict the G4 formation propensities.
Paper Structure (22 sections, 8 equations, 6 figures, 2 tables)

This paper contains 22 sections, 8 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Schematic representation of canonical G-quadruplex structures in DNA. (A) Structure of planar G-quartet formed by four guanine bases through Hoogsteen type of H-bonding, and a central cation usually $\mathrm{K}^+$. The G-quadruplex structures are commonly comprised of three stacks planar G-quartet top to one another via $\pi-\pi$ stacking interaction. (B) Top-view of human promoter G4 found in c-MYC gene (PDB : 1XAV).
  • Figure 2: Schematic diagram of our proposed model G4-Attention. Our Proposed model has three blocks: CNN block, Bi-LSTM block and finally Attention Fusion block.
  • Figure 3: G4-Attention outperforms all the existing techniques across all negative types (a) $K^{+}$ and (b) $K^{+}$ + PDS on held out chromosome 1 on G4-seqB dataset.
  • Figure 4: The performance of G4-Attention on test chromosomes 1, 3, 5, 7, 9 in the G4-seqIB dataset is depicted in this figure, where both AUROC and AUPRC are present.
  • Figure 5: The comparative analysis of G4-Attention's performance on novel datasets from three distinct species against G4Detector is illustrated in the figure. This evaluation specifically focuses on the AUC scores on (a) $K^{+}$ datasets and (b) $K^{+}$ + PDS datasets, including data from mouse, zebrafish, and drosophila species. AUC scores for G4-Attention and G4Detector are indicated at the top of the bars.
  • ...and 1 more figures