Natural Language Processing Methods for the Study of Protein-Ligand Interactions

James Michels; Ramya Bandarupalli; Amin Ahangar Akbari; Thai Le; Hong Xiao; Jing Li; Erik F. Y. Hom

Natural Language Processing Methods for the Study of Protein-Ligand Interactions

James Michels, Ramya Bandarupalli, Amin Ahangar Akbari, Thai Le, Hong Xiao, Jing Li, Erik F. Y. Hom

TL;DR

This review examines how NLP techniques have been adapted to decode the “language” of proteins and small molecule ligands to predict protein-ligand interactions (PLIs) and argues that focusing on improving data quality, enhancing model robustness, and fostering both collaboration and competition could catalyze future advances in machine-learning-based predictions of PLIs.

Abstract

Recent advances in Natural Language Processing (NLP) have ignited interest in developing effective methods for predicting protein-ligand interactions (PLIs) given their relevance to drug discovery and protein engineering efforts and the ever-growing volume of biochemical sequence and structural data available. The parallels between human languages and the "languages" used to represent proteins and ligands have enabled the use of NLP machine learning approaches to advance PLI studies. In this review, we explain where and how such approaches have been applied in the recent literature and discuss useful mechanisms such as long short-term memory, transformers, and attention. We conclude with a discussion of the current limitations of NLP methods for the study of PLIs as well as key challenges that need to be addressed in future work.

Natural Language Processing Methods for the Study of Protein-Ligand Interactions

TL;DR

Abstract

Paper Structure (21 sections, 4 figures, 6 tables)

This paper contains 21 sections, 4 figures, 6 tables.

1. Introduction
1.1. Overview of Natural Language Processing (NLP)
2. The "Language" of Proteins
3. The "Language" of Ligands
4. Protein/̄Ligand Interaction Data and Datasets
5. Machine Learning and NLP for PLIs
5.1. The Extract/̄Fuse/̄Predict Framework
5.2. Extraction of Embeddings
5.2.1. Recurrent Neural Networks
5.2.2. Attention/̄Based Architectures
5.2.3. Transformers
5.3. Fusion of Protein/̄Ligand Representations: Concatenation or Cross-Attention
5.4 Prediction of Target Variables
5.5. Evaluation
6. Challenges and Future Directions
...and 6 more sections

Figures (4)

Figure 1: The Language of Protein Sequences and SMILES: NLP methods can be applied to text representations to infer local and global properties of human language, proteins, and molecules alike. Local properties are inferred characteristics of sub-sequences in text: (i) for a human language, this can include part of speech or a role a specific word serves; (ii) for a protein sequence, this can include secondary structures, post-translational modifications, and functional sites; (iii) for a SMILES string, this can include functional groups and characters used within SMILES syntax to indicate chemical attributes. Global properties are inferred from a text in its entirety: (i) for a human language, this can include information such as authorship, tone, and synopses; (ii) for a protein sequence, this can include the protein's structure, stability, and dynamic properties; and (iii) for a SMILES string, this can include the ligand's 2D molecular structure and other biochemical properties.
Figure 2: Summary of the Data Preparation, Model Creation, and Model Evaluation Workflow. Model Creation for PLI studies follows an Extract/̄Fuse/̄Predict Framework: input protein and ligand data are extracted and embedded, combined, and passed into a machine learning model to generate predictions.
Figure 3: Sample Attention Weights for Relating Protein and Ligand. The heatmaps on the left help visualize the weighted importance of select protein residues and ligand atoms in a PLI. Structural views of the protein/̄ligand binding pocket are shown in the middle, with insets of the 2D ligand structures on the right. The colored residues and red color highlights indicate AAs in the protein binding pocket and ligand atoms with high attention scores. Adapted and modified from Figure 7 of Wu et al.Wu2024-ma used with permission under license CC BY 4.0.
Figure : Table of Contents Graphic

Natural Language Processing Methods for the Study of Protein-Ligand Interactions

TL;DR

Abstract

Natural Language Processing Methods for the Study of Protein-Ligand Interactions

TL;DR

Abstract

Table of Contents

Figures (4)