Table of Contents
Fetching ...

GCBLANE: A graph-enhanced convolutional BiLSTM attention network for improved transcription factor binding site prediction

Jonas Chris Ferrao, Dickson Dias, Sweta Morajkar, Manisha Gokuldas Fal Dessai

TL;DR

TFBS identification remains challenging due to vast genomic data and complex binding patterns. The authors introduce GCBLANE, a graph-enhanced convolutional BiLSTM attention network that fuses CNN, multi-head attention, BiLSTM, and graph neural networks to capture both sequence motifs and graph-structured context from DNA. Across 690 ENCODE ChIP-seq datasets and a subset of 165 datasets, GCBLANE achieves ROC-AUC around $0.94$–$0.95$, outperforming state-of-the-art sequence- and multimodal-based predictors, and demonstrates strong generalization across cell lines and TFs. The two-stage transfer learning framework and de Bruijn-based graph representation contribute to improved efficiency and predictive accuracy, with potential future gains from incorporating DNA shape information. Overall, GCBLANE advances TFBS prediction by effectively integrating sequential and graph-based features in a sequence-only framework, offering scalable performance improvements for large-scale genomic analyses.

Abstract

Identifying transcription factor binding sites (TFBS) is crucial for understanding gene regulation, as these sites enable transcription factors (TFs) to bind to DNA and modulate gene expression. Despite advances in high-throughput sequencing, accurately identifying TFBS remains challenging due to the vast genomic data and complex binding patterns. GCBLANE, a graph-enhanced convolutional bidirectional Long Short-Term Memory (LSTM) attention network, is introduced to address this issue. It integrates convolutional, multi-head attention, and recurrent layers with a graph neural network to detect key features for TFBS prediction. On 690 ENCODE ChIP-Seq datasets, GCBLANE achieved an average AUC of 0.943, and on 165 ENCODE datasets, it reached an AUC of 0.9495, outperforming advanced models that utilize multimodal approaches, including DNA shape information. This result underscores GCBLANE's effectiveness compared to other methods. By combining graph-based learning with sequence analysis, GCBLANE significantly advances TFBS prediction.

GCBLANE: A graph-enhanced convolutional BiLSTM attention network for improved transcription factor binding site prediction

TL;DR

TFBS identification remains challenging due to vast genomic data and complex binding patterns. The authors introduce GCBLANE, a graph-enhanced convolutional BiLSTM attention network that fuses CNN, multi-head attention, BiLSTM, and graph neural networks to capture both sequence motifs and graph-structured context from DNA. Across 690 ENCODE ChIP-seq datasets and a subset of 165 datasets, GCBLANE achieves ROC-AUC around , outperforming state-of-the-art sequence- and multimodal-based predictors, and demonstrates strong generalization across cell lines and TFs. The two-stage transfer learning framework and de Bruijn-based graph representation contribute to improved efficiency and predictive accuracy, with potential future gains from incorporating DNA shape information. Overall, GCBLANE advances TFBS prediction by effectively integrating sequential and graph-based features in a sequence-only framework, offering scalable performance improvements for large-scale genomic analyses.

Abstract

Identifying transcription factor binding sites (TFBS) is crucial for understanding gene regulation, as these sites enable transcription factors (TFs) to bind to DNA and modulate gene expression. Despite advances in high-throughput sequencing, accurately identifying TFBS remains challenging due to the vast genomic data and complex binding patterns. GCBLANE, a graph-enhanced convolutional bidirectional Long Short-Term Memory (LSTM) attention network, is introduced to address this issue. It integrates convolutional, multi-head attention, and recurrent layers with a graph neural network to detect key features for TFBS prediction. On 690 ENCODE ChIP-Seq datasets, GCBLANE achieved an average AUC of 0.943, and on 165 ENCODE datasets, it reached an AUC of 0.9495, outperforming advanced models that utilize multimodal approaches, including DNA shape information. This result underscores GCBLANE's effectiveness compared to other methods. By combining graph-based learning with sequence analysis, GCBLANE significantly advances TFBS prediction.

Paper Structure

This paper contains 21 sections, 12 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: One-Hot Encoding
  • Figure 2: De Bruijn Representation of a short DNA sequence
  • Figure 3: Model Architecture of GCBLANE
  • Figure 4: Visualisation of GCBLANE performance with classification metrics.
  • Figure 5: ROC Curve with AUC score.
  • ...and 8 more figures