Accurate Prediction of Ligand-Protein Interaction Affinities with Fine-Tuned Small Language Models
Ben Fauber
TL;DR
The study tackles the challenge of predicting ligand-protein interaction affinities (DTI) by fine-tuning pretrained generative small language models (SLMs) with instruction-based prompts. It builds large LPI datasets (LPI-1.5M and LPI-3.5M) from PubChem, BindingDB, and Davis, and frames affinities as ordinal classes A–E, enabling zero-shot evaluation using only SMILES and UNIPROT sequences. The results show that instruction-fine-tuned SLMs achieve substantial accuracy improvements (up to 44% exact matches with 3.5M training examples) and strong near-match performance (up to 97%), outperforming traditional ML and competitive with FEP+-style metrics. This approach offers a simple, scalable, high-throughput method for ranking ligand candidates, with clear practical impact for accelerating drug discovery campaigns, especially as data scale increases.
Abstract
We describe the accurate prediction of ligand-protein interaction (LPI) affinities, also known as drug-target interactions (DTI), with instruction fine-tuned pretrained generative small language models (SLMs). We achieved accurate predictions for a range of affinity values associated with ligand-protein interactions on out-of-sample data in a zero-shot setting. Only the SMILES string of the ligand and the amino acid sequence of the protein were used as the model inputs. Our results demonstrate a clear improvement over machine learning (ML) and free-energy perturbation (FEP+) based methods in accurately predicting a range of ligand-protein interaction affinities, which can be leveraged to further accelerate drug discovery campaigns against challenging therapeutic targets.
