Table of Contents
Fetching ...

LAMP-PRo: Label-aware Attention for Multi-label Prediction of DNA- and RNA-binding Proteins using Protein Language Models

Nimisha Ghosh, Dheeran Sankaran, Rahul Balakrishnan Adhi, Sharath S, Amrut Anand

TL;DR

LAMP-PRo tackles the cross-prediction challenge between DNA- and RNA-binding proteins and extends to dual DRBP recognition by combining ESM-2 protein embeddings with CNN, MHSA, label-aware attention, and cross-label attention. The model learns label-specific sequence representations and explicitly models DBP–RBP dependencies to infer DRBP, aided by gated residuals and an invalid-label penalty to stabilize training. Across multiple datasets, LAMP-PRo achieves state-of-the-art or competitive AUC and 1-AURC metrics for DBP, RBP, and DRBP predictions, and demonstrates strong generalization to unseen proteins, supported by visual analyses of attention weights and biological relevance. The work provides interpretable predictions and a publicly available codebase, facilitating further exploration of label-aware multi-label approaches in protein-nucleic acid binding classification.

Abstract

Identifying DNA- (DBPs) and RNA-binding proteins (RBPs) is crucial for the understanding of cell function, molecular interactions as well as regulatory functions. Owing to their high similarity, most of the existing approaches face challenges in differentiating between DBPs and RBPs leading to high cross-prediction errors. Moreover, identifying proteins which bind to both DNA and RNA (DRBPs) is also quite a challenging task. In this regard, we propose a novel framework viz. LAMP-PRo which is based on pre-trained protein language model (PLM), attention mechanisms and multi-label learning to mitigate these issues. First, pre-trained PLM such ESM-2 is used for embedding the protein sequences followed by convolutional neural network (CNN). Subsequently multi-head self-attention mechanism is applied for the contextual information while label-aware attention is used to compute class-specific representations by attending to the sequence in a way that is tailored to each label (DBP, RBP and non-NABP) in a multi-label setup. We have also included a novel cross-label attention mechanism to explicitly capture dependencies between DNA- and RNA-binding proteins, enabling more accurate prediction of DRBP. Finally, a linear layer followed by a sigmoid function are used for the final prediction. Extensive experiments are carried out to compare LAMP-PRo with the existing methods wherein the proposed model shows consistent competent performance. Furthermore, we also provide visualization to showcase model interpretability, highlighting which parts of the sequence are most relevant for a predicted label. The original datasets are available at http://bliulab.net/iDRBP\_MMC and the codes are available at https://github.com/NimishaGhosh/LAMP-PRo.

LAMP-PRo: Label-aware Attention for Multi-label Prediction of DNA- and RNA-binding Proteins using Protein Language Models

TL;DR

LAMP-PRo tackles the cross-prediction challenge between DNA- and RNA-binding proteins and extends to dual DRBP recognition by combining ESM-2 protein embeddings with CNN, MHSA, label-aware attention, and cross-label attention. The model learns label-specific sequence representations and explicitly models DBP–RBP dependencies to infer DRBP, aided by gated residuals and an invalid-label penalty to stabilize training. Across multiple datasets, LAMP-PRo achieves state-of-the-art or competitive AUC and 1-AURC metrics for DBP, RBP, and DRBP predictions, and demonstrates strong generalization to unseen proteins, supported by visual analyses of attention weights and biological relevance. The work provides interpretable predictions and a publicly available codebase, facilitating further exploration of label-aware multi-label approaches in protein-nucleic acid binding classification.

Abstract

Identifying DNA- (DBPs) and RNA-binding proteins (RBPs) is crucial for the understanding of cell function, molecular interactions as well as regulatory functions. Owing to their high similarity, most of the existing approaches face challenges in differentiating between DBPs and RBPs leading to high cross-prediction errors. Moreover, identifying proteins which bind to both DNA and RNA (DRBPs) is also quite a challenging task. In this regard, we propose a novel framework viz. LAMP-PRo which is based on pre-trained protein language model (PLM), attention mechanisms and multi-label learning to mitigate these issues. First, pre-trained PLM such ESM-2 is used for embedding the protein sequences followed by convolutional neural network (CNN). Subsequently multi-head self-attention mechanism is applied for the contextual information while label-aware attention is used to compute class-specific representations by attending to the sequence in a way that is tailored to each label (DBP, RBP and non-NABP) in a multi-label setup. We have also included a novel cross-label attention mechanism to explicitly capture dependencies between DNA- and RNA-binding proteins, enabling more accurate prediction of DRBP. Finally, a linear layer followed by a sigmoid function are used for the final prediction. Extensive experiments are carried out to compare LAMP-PRo with the existing methods wherein the proposed model shows consistent competent performance. Furthermore, we also provide visualization to showcase model interpretability, highlighting which parts of the sequence are most relevant for a predicted label. The original datasets are available at http://bliulab.net/iDRBP\_MMC and the codes are available at https://github.com/NimishaGhosh/LAMP-PRo.

Paper Structure

This paper contains 24 sections, 13 equations, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: Pipeline of LAMP-PRo. Initially, ESM-2 is used to extract the embeddings from the input protein sequences. Next, CNN is applied to extract local features while MHSA captures the global features. Later label-aware attention learns different attentions per label for DBP, RBP and non-NABP. Cross-label attention further models dependencies between DBP and RBP. Finally, a linear layer followed by a sigmoid function produces multi-label probabilities.
  • Figure 2: Cross-Prediction performance of the different variants on TEST474 dataset
  • Figure 3: Comparison of different methods on EZL dataset
  • Figure 4: Visual analysis of correctly predicted sequences where (a) DBP and (b) RBP attention weights for a DBP sequence and (c) RBP and (d) DBP attention weights for a RBP sequence