Table of Contents
Fetching ...

Accurate Prediction of Ligand-Protein Interaction Affinities with Fine-Tuned Small Language Models

Ben Fauber

TL;DR

The study tackles the challenge of predicting ligand-protein interaction affinities (DTI) by fine-tuning pretrained generative small language models (SLMs) with instruction-based prompts. It builds large LPI datasets (LPI-1.5M and LPI-3.5M) from PubChem, BindingDB, and Davis, and frames affinities as ordinal classes A–E, enabling zero-shot evaluation using only SMILES and UNIPROT sequences. The results show that instruction-fine-tuned SLMs achieve substantial accuracy improvements (up to 44% exact matches with 3.5M training examples) and strong near-match performance (up to 97%), outperforming traditional ML and competitive with FEP+-style metrics. This approach offers a simple, scalable, high-throughput method for ranking ligand candidates, with clear practical impact for accelerating drug discovery campaigns, especially as data scale increases.

Abstract

We describe the accurate prediction of ligand-protein interaction (LPI) affinities, also known as drug-target interactions (DTI), with instruction fine-tuned pretrained generative small language models (SLMs). We achieved accurate predictions for a range of affinity values associated with ligand-protein interactions on out-of-sample data in a zero-shot setting. Only the SMILES string of the ligand and the amino acid sequence of the protein were used as the model inputs. Our results demonstrate a clear improvement over machine learning (ML) and free-energy perturbation (FEP+) based methods in accurately predicting a range of ligand-protein interaction affinities, which can be leveraged to further accelerate drug discovery campaigns against challenging therapeutic targets.

Accurate Prediction of Ligand-Protein Interaction Affinities with Fine-Tuned Small Language Models

TL;DR

The study tackles the challenge of predicting ligand-protein interaction affinities (DTI) by fine-tuning pretrained generative small language models (SLMs) with instruction-based prompts. It builds large LPI datasets (LPI-1.5M and LPI-3.5M) from PubChem, BindingDB, and Davis, and frames affinities as ordinal classes A–E, enabling zero-shot evaluation using only SMILES and UNIPROT sequences. The results show that instruction-fine-tuned SLMs achieve substantial accuracy improvements (up to 44% exact matches with 3.5M training examples) and strong near-match performance (up to 97%), outperforming traditional ML and competitive with FEP+-style metrics. This approach offers a simple, scalable, high-throughput method for ranking ligand candidates, with clear practical impact for accelerating drug discovery campaigns, especially as data scale increases.

Abstract

We describe the accurate prediction of ligand-protein interaction (LPI) affinities, also known as drug-target interactions (DTI), with instruction fine-tuned pretrained generative small language models (SLMs). We achieved accurate predictions for a range of affinity values associated with ligand-protein interactions on out-of-sample data in a zero-shot setting. Only the SMILES string of the ligand and the amino acid sequence of the protein were used as the model inputs. Our results demonstrate a clear improvement over machine learning (ML) and free-energy perturbation (FEP+) based methods in accurately predicting a range of ligand-protein interaction affinities, which can be leveraged to further accelerate drug discovery campaigns against challenging therapeutic targets.
Paper Structure (28 sections, 10 figures, 4 tables)

This paper contains 28 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Illustration of our proposed task: prediction of ordinal affinity values associated with ligand-protein interactions on out-of-sample data in a zero-shot setting. Only the SMILES string of the ligand (top left) and the amino acid sequence of the target protein (bottom left) are used as the model inputs.
  • Figure 2: Physics-based virtual screening to rank order ligand-protein interactions with free-energy perturbation calculations (FEP+). An example is shown with the DCAI ligand and human KRas4B-G12D protein [PDB: 4DST]. The SMILES string of the ligand must be converted into a low-energy 3D-conformation (grey). Free-energy perturbation calculations require ligands to be bound in an X-ray cocrystal of the ligand (yellow) and protein (blue), or low-energy conformations of ligands (grey) are docked into a known X-ray structure, or a predicted 3D-structure from the corresponding amino acid sequence, of the target protein (blue). Free-energy perturbation calculations often require multiple validated binders of known binding affinities to benchmark the method ($\Delta G_{exp}$) and rank order the FEP+ calculation outcomes of proposed ligands ($\Delta\Delta G_{FEP+}$) relative to the benchmark $\Delta G_{exp}$ values.
  • Figure 3: Examples of (a) ligand binding pocket and (b) allosteric ligand-protein interactions. (a) Cocrystal structure (1.99 Å) of a tertiary sulfonamide ligand (orange) in complex with human RORc-LBD (beige) [PDB: 4WQP]. (b) Cocrystal structure (2.39 Å) of the small molecule ligand DCAI (yellow) in complex with human KRas4B-G12D (light blue) [PDB: 4DST]. Both images depict the ligand binding pockets of the respective proteins as transparent surfaces (light grey), and protein side chains are omitted for clarity. Notably, (a) exemplifies a deep ligand binding pocket within the protein, whereas (b) illustrates an allosteric interaction of the DCAI ligand on the protein surface. Further, (b) clearly lacks any significant binding pocket interactions between the ligand and protein, and the DCAI ligand disrupts the protein-protein interaction between the KRas and SOS proteins (SOS protein not shown).
  • Figure 4: Sources of the BindingDB ligand-protein interaction data set as of April 2024. Raw count values are shown on the x-axis, and the corresponding percentage of the total count for each data source are noted as labels on each bar of the plot.
  • Figure 5: Ordinal affinity value distributions of the BindingDB-2M (orange), LPI-1.5M (blue), and LPI-3.5M (grey) data sets. The ligand-protein interaction ordinal affinity values shown on the x-axis are: A (pIC50$\ge 8$), B ($8 >$ pIC50$\ge 7$), C ($7 >$ pIC50$\ge 6$), D ($6 >$ pIC50$\ge 5$), and E ($5 >$ pIC50). Raw count values are shown on the y-axis, and the corresponding percentage of the total data set for each class are noted as labels on each bar of the plot.
  • ...and 5 more figures