Improving generalisability of 3D binding affinity models in low data regimes

Julia Buhmann; Ward Haddadin; Lukáš Pravda; Alan Bilsland; Hagen Triendl

Improving generalisability of 3D binding affinity models in low data regimes

Julia Buhmann, Ward Haddadin, Lukáš Pravda, Alan Bilsland, Hagen Triendl

TL;DR

This work addresses the generalisability of 3D binding affinity models, especially in low-data regimes, by introducing a low-similarity PDBBind split that minimizes train–test leakage and enables fair comparison across global 3D and local models. It benchmarks multiple model families, including 3D EGNNs, local ligand-based models, and strong baselines, and demonstrates that global 3D models outperform local ones when data are scarce. The study further shows that explicitly modeling hydrogen atoms and two novel pre-training strategies—quantum-mechanical energy pre-training and diffusion pre-training—substantially boost performance in low-data settings, with gains fading as data increases. These findings offer practical guidance for developing more generalisable 3D binding affinity models and highlight promising pre-training approaches to unlock the potential of graph neural networks in drug design.

Abstract

Predicting protein-ligand binding affinity is an essential part of computer-aided drug design. However, generalisable and performant global binding affinity models remain elusive, particularly in low data regimes. Despite the evolution of model architectures, current benchmarks are not well-suited to probe the generalisability of 3D binding affinity models. Furthermore, 3D global architectures such as GNNs have not lived up to performance expectations. To investigate these issues, we introduce a novel split of the PDBBind dataset, minimizing similarity leakage between train and test sets and allowing for a fair and direct comparison between various model architectures. On this low similarity split, we demonstrate that, in general, 3D global models are superior to protein-specific local models in low data regimes. We also demonstrate that the performance of GNNs benefits from three novel contributions: supervised pre-training via quantum mechanical data, unsupervised pre-training via small molecule diffusion, and explicitly modeling hydrogen atoms in the input graph. We believe that this work introduces promising new approaches to unlock the potential of GNN architectures for binding affinity modelling.

Improving generalisability of 3D binding affinity models in low data regimes

TL;DR

Abstract

Paper Structure (37 sections, 2 equations, 10 figures, 6 tables)

This paper contains 37 sections, 2 equations, 10 figures, 6 tables.

Introduction
Methods
PDBBind dataset
Structure preparation
Models
Model families
Single-Protein local models
EGNN models
Pre-Trained EGNNs
Hydrogens
Single-Graph vs. Multi-Graph
RF-Score and OnionNet-2
Baseline models
Ligand-Bias model
Molecular-Weight model
...and 22 more sections

Figures (10)

Figure 1: Schematic of Low-Sim benchmarking splits used in this study. Global models are trained on both train sets from the Case-Study-Proteins and the Other-Proteins. Local models are individually built for each of the eight proteins in the Case-Study-Proteins split. They require a minimum set of already available ligands for a specific protein for training, thus can only be created for the 5%, 30%, 80% splits and use only data from the Case-Study-Proteins split. Note that bars are not to scale with number of samples.
Figure 2: Overall and stratified performance at increasing train data fraction for different model families. In the low data regime, global 3D models outperform local models. Left: The error bars denote the standard deviation across the three test folds. Right: The boxplots represent the performance distribution over the eight proteins in the Case-Study-Proteins set.
Figure 3: Effect of number of training data points on performance. Each point represents a protein from the eight case-study proteins. The global models show a clear advantage at low data regimes.
Figure 4: Effect of the EGNN additions proposed in this study on model performance. The overall performance across all eight proteins in the Case-Study-Proteins set is reported. The error bars denote the standard deviation across the three test folds. In the 0% split case, there is a only a single test fold. Due to the non-deterministic nature of training, variation in performance is due to training the same EGNN model three times. Pre-training: Quantum mechanical pre-training provides the greatest advantage, followed closely by diffusion pre-training. Hydrogens: Including explicit hydrogens is very important at low data levels. Interacting pose: No consistent pattern when comparing single-graph versus multi-graph.
Figure 5: Overall and stratified performance at increasing train data fraction for different model families. In the low data regime, global 3D models outperform local models. Left: The error bars denote the standard deviation across the 3 test folds. Right: The boxplots represent the performance distribution over the eight proteins in the Case-Study-Proteins set.
...and 5 more figures

Improving generalisability of 3D binding affinity models in low data regimes

TL;DR

Abstract

Improving generalisability of 3D binding affinity models in low data regimes

Authors

TL;DR

Abstract

Table of Contents

Figures (10)