XAI-Driven Deep Learning for Protein Sequence Functional Group Classification
Pratik Chakraborty, Aryan Bhargava
TL;DR
The paper addresses protein sequence functional group classification and interpretability. It compares four DL architectures—CNN, BiLSTM, CNN-BiLSTM, and CNN-Attention—trained on overlapping five-mers with Grad-CAM and Integrated Gradients for motif-level explanations. CNN achieves 91.80% validation accuracy, and motifs enriched in histidine, aspartate/glutamate, and lysine correspond to catalytic and metal-binding sites in transferases, demonstrated across architectures. The work shows that deep learning can yield high predictive performance while providing interpretable biochemical insights that connect sequence patterns to function.
Abstract
Proteins perform essential biological functions, and accurate classification of their sequences is critical for understanding structure-function relationships, enzyme mechanisms, and molecular interactions. This study presents a deep learning-based framework for functional group classification of protein sequences derived from the Protein Data Bank (PDB). Four architectures were implemented: Convolutional Neural Network (CNN), Bidirectional Long Short-Term Memory (BiLSTM), CNN-BiLSTM hybrid, and CNN with Attention. Each model was trained using k-mer integer encoding to capture both local and long-range dependencies. Among these, the CNN achieved the highest validation accuracy of 91.8%, demonstrating the effectiveness of localized motif detection. Explainable AI techniques, including Grad-CAM and Integrated Gradients, were applied to interpret model predictions and identify biologically meaningful sequence motifs. The discovered motifs, enriched in histidine, aspartate, glutamate, and lysine, represent amino acid residues commonly found in catalytic and metal-binding regions of transferase enzymes. These findings highlight that deep learning models can uncover functionally relevant biochemical signatures, bridging the gap between predictive accuracy and biological interpretability in protein sequence analysis.
