Predicting ATP binding sites in protein sequences using Deep Learning and Natural Language Processing

Shreyas V; Swati Agarwal

Predicting ATP binding sites in protein sequences using Deep Learning and Natural Language Processing

Shreyas V, Swati Agarwal

TL;DR

The study tackles the challenge of predicting ATP-binding residues from protein sequences, addressing the limitations of wet-lab experiments by leveraging sequence-based NLP features. It proposes a multi-branch deep learning framework that fuses PSSM, PSIPRED, and multiple sequence embeddings (FastText, MP3Vec, BERT) within a SMOTE-balanced, LightGBM-enabled pipeline to classify residues. Across three benchmark datasets, the approach achieves competitive or superior performance, with window size $W=17$ and notable leucine enrichment at binding sites. The work offers scalable, efficient residue-level annotations that can aid protein function annotation and drug design, and outlines concrete extensions to incorporate more structure-aware features and ensemble strategies.

Abstract

Predicting ATP-Protein Binding sites in genes is of great significance in the field of Biology and Medicine. The majority of research in this field has been conducted through time- and resource-intensive 'wet experiments' in laboratories. Over the years, researchers have been investigating computational methods computational methods to accomplish the same goals, utilising the strength of advanced Deep Learning and NLP algorithms. In this paper, we propose to develop methods to classify ATP-Protein binding sites. We conducted various experiments mainly using PSSMs and several word embeddings as features. We used 2D CNNs and LightGBM classifiers as our chief Deep Learning Algorithms. The MP3Vec and BERT models have also been subjected to testing in our study. The outcomes of our experiments demonstrated improvement over the state-of-the-art benchmarks.

Predicting ATP binding sites in protein sequences using Deep Learning and Natural Language Processing

TL;DR

and notable leucine enrichment at binding sites. The work offers scalable, efficient residue-level annotations that can aid protein function annotation and drug design, and outlines concrete extensions to incorporate more structure-aware features and ensemble strategies.

Abstract

Paper Structure (25 sections, 2 equations, 10 figures, 5 tables)

This paper contains 25 sections, 2 equations, 10 figures, 5 tables.

Introduction
Experimental Datasets
PATP 388 and PATP-41
ATP-227 and ATP-17
ATP-168
Proposed Methodology
Feature Engineering
Position Specific Scoring Matrix (PSSM)
FastText vectors
Predicted Secondary Structure
Addressing Data Imbalance
SMOTE algorithm:
LightGBM:
BERT:
MP3Vec:
...and 10 more sections

Figures (10)

Figure 1: A Sample Snapshot of a Protein Sequence Present in our Dataset.
Figure 2: A Snapshot of a Binary Encoded Labels Denoting the Presence and Absence of Protein Binding Sites.
Figure 3: Proposed Model Architecture
Figure 4: Effect of window size on AUC
Figure 5: Threshold vs. MCC for PATP-388, ATP-227 & ATP-168
...and 5 more figures

Predicting ATP binding sites in protein sequences using Deep Learning and Natural Language Processing

TL;DR

Abstract

Predicting ATP binding sites in protein sequences using Deep Learning and Natural Language Processing

Authors

TL;DR

Abstract

Table of Contents

Figures (10)