Table of Contents
Fetching ...

BAPULM: Binding Affinity Prediction using Language Models

Radheesh Sharma Meda, Amir Barati Farimani

TL;DR

BAPULM is introduced, an innovative sequence-based framework that leverages the chemical latent representations of proteins via ProtT5-XL-U50 and ligands through MolFormer, eliminating reliance on complex 3D configurations and offering a scalable alternative to 3D-centric methods for screening potential ligands.

Abstract

Identifying drug-target interactions is essential for developing effective therapeutics. Binding affinity quantifies these interactions, and traditional approaches rely on computationally intensive 3D structural data. In contrast, language models can efficiently process sequential data, offering an alternative approach to molecular representation. In the current study, we introduce BAPULM, an innovative sequence-based framework that leverages the chemical latent representations of proteins via ProtT5-XL-U50 and ligands through MolFormer, eliminating reliance on complex 3D configurations. Our approach was validated extensively on benchmark datasets, achieving scoring power (R) values of 0.925 $\pm$ 0.043, 0.914 $\pm$ 0.004, and 0.8132 $\pm$ 0.001 on benchmark1k2101, Test2016_290, and CSAR-HiQ_36, respectively. These findings indicate the robustness and accuracy of BAPULM across diverse datasets and underscore the potential of sequence-based models in-silico drug discovery, offering a scalable alternative to 3D-centric methods for screening potential ligands.

BAPULM: Binding Affinity Prediction using Language Models

TL;DR

BAPULM is introduced, an innovative sequence-based framework that leverages the chemical latent representations of proteins via ProtT5-XL-U50 and ligands through MolFormer, eliminating reliance on complex 3D configurations and offering a scalable alternative to 3D-centric methods for screening potential ligands.

Abstract

Identifying drug-target interactions is essential for developing effective therapeutics. Binding affinity quantifies these interactions, and traditional approaches rely on computationally intensive 3D structural data. In contrast, language models can efficiently process sequential data, offering an alternative approach to molecular representation. In the current study, we introduce BAPULM, an innovative sequence-based framework that leverages the chemical latent representations of proteins via ProtT5-XL-U50 and ligands through MolFormer, eliminating reliance on complex 3D configurations. Our approach was validated extensively on benchmark datasets, achieving scoring power (R) values of 0.925 0.043, 0.914 0.004, and 0.8132 0.001 on benchmark1k2101, Test2016_290, and CSAR-HiQ_36, respectively. These findings indicate the robustness and accuracy of BAPULM across diverse datasets and underscore the potential of sequence-based models in-silico drug discovery, offering a scalable alternative to 3D-centric methods for screening potential ligands.

Paper Structure

This paper contains 12 sections, 4 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The overview of the BAPULM framework, which integrates the ProtT5-XL-U50 for protein sequnces and Molformer for ligand SMILES for feature extraction module while encoding the sequnces. These embeddings are aligned through projection layers and fed into a feed-forward predictive network to predict binding affinity.
  • Figure 2: Distribution of (a) Protein sequence lengths range from 13 to 7073 amino acids, showing a skewed distribution with most sequences concentrated under 1000 amino acids. (b) Ligand SMILES string lengths range from 4 to 547 characters, also displaying a skewed distribution with the majority of strings being shorter than 100 characters.
  • Figure 3: Evaluation of BAPULM on multiple datasets where the scatter plots depict the correlation between predicted and experimental $\text{pK}_{\text{d}}$ values. The datasets represented include the (a) Training ,(b) Validation (c) Benchmark1k2101,(d) Test2016_290, and (e)CSAR-HiQ_36.
  • Figure 4: Embedding visualizations of protein-ligand binding affinity mapped onto features extracted from (a) BAPULM, (b) ProtBert & Molformer, and (c) ProtBert & ChemBerta, illustrating the latent space representations of each configuration on train dataset.