Improving generalisability of 3D binding affinity models in low data regimes
Julia Buhmann, Ward Haddadin, Lukáš Pravda, Alan Bilsland, Hagen Triendl
TL;DR
This work addresses the generalisability of 3D binding affinity models, especially in low-data regimes, by introducing a low-similarity PDBBind split that minimizes train–test leakage and enables fair comparison across global 3D and local models. It benchmarks multiple model families, including 3D EGNNs, local ligand-based models, and strong baselines, and demonstrates that global 3D models outperform local ones when data are scarce. The study further shows that explicitly modeling hydrogen atoms and two novel pre-training strategies—quantum-mechanical energy pre-training and diffusion pre-training—substantially boost performance in low-data settings, with gains fading as data increases. These findings offer practical guidance for developing more generalisable 3D binding affinity models and highlight promising pre-training approaches to unlock the potential of graph neural networks in drug design.
Abstract
Predicting protein-ligand binding affinity is an essential part of computer-aided drug design. However, generalisable and performant global binding affinity models remain elusive, particularly in low data regimes. Despite the evolution of model architectures, current benchmarks are not well-suited to probe the generalisability of 3D binding affinity models. Furthermore, 3D global architectures such as GNNs have not lived up to performance expectations. To investigate these issues, we introduce a novel split of the PDBBind dataset, minimizing similarity leakage between train and test sets and allowing for a fair and direct comparison between various model architectures. On this low similarity split, we demonstrate that, in general, 3D global models are superior to protein-specific local models in low data regimes. We also demonstrate that the performance of GNNs benefits from three novel contributions: supervised pre-training via quantum mechanical data, unsupervised pre-training via small molecule diffusion, and explicitly modeling hydrogen atoms in the input graph. We believe that this work introduces promising new approaches to unlock the potential of GNN architectures for binding affinity modelling.
