Predicting ATP binding sites in protein sequences using Deep Learning and Natural Language Processing
Shreyas V, Swati Agarwal
TL;DR
The study tackles the challenge of predicting ATP-binding residues from protein sequences, addressing the limitations of wet-lab experiments by leveraging sequence-based NLP features. It proposes a multi-branch deep learning framework that fuses PSSM, PSIPRED, and multiple sequence embeddings (FastText, MP3Vec, BERT) within a SMOTE-balanced, LightGBM-enabled pipeline to classify residues. Across three benchmark datasets, the approach achieves competitive or superior performance, with window size $W=17$ and notable leucine enrichment at binding sites. The work offers scalable, efficient residue-level annotations that can aid protein function annotation and drug design, and outlines concrete extensions to incorporate more structure-aware features and ensemble strategies.
Abstract
Predicting ATP-Protein Binding sites in genes is of great significance in the field of Biology and Medicine. The majority of research in this field has been conducted through time- and resource-intensive 'wet experiments' in laboratories. Over the years, researchers have been investigating computational methods computational methods to accomplish the same goals, utilising the strength of advanced Deep Learning and NLP algorithms. In this paper, we propose to develop methods to classify ATP-Protein binding sites. We conducted various experiments mainly using PSSMs and several word embeddings as features. We used 2D CNNs and LightGBM classifiers as our chief Deep Learning Algorithms. The MP3Vec and BERT models have also been subjected to testing in our study. The outcomes of our experiments demonstrated improvement over the state-of-the-art benchmarks.
