What Ails Generative Structure-based Drug Design: Expressivity is Too Little or Too Much?
Rafał Karczewski, Samuel Kaski, Markus Heinonen, Vikas Garg
TL;DR
The paper investigates why generative models for structure-based drug design underperform and argues that both GNN expressivity limits and overparameterization contribute to the gap. It formalizes expressivity limits for LU‑GNNs in protein–ligand contexts and introduces SimpleSBDD, a two‑phase, performance‑aware framework that decouples the unlabelled molecular graph from atom types and optimizes for binding affinity via an economical surrogate. Through extensive experiments on CrossDocked2020 and drug repurposing tasks, SimpleSBDD achieves state‑of‑the‑art docking performance with up to $1000\times$ faster runtimes and about $100\times$ fewer parameters, demonstrating the practicality of targeted optimization over raw expressivity. The findings advocate rethinking SBDD toward robust generalization and computational efficiency, with potential to streamline docking pipelines while highlighting the need for more reliable validation beyond docking scores.
Abstract
Several generative models with elaborate training and sampling procedures have been proposed to accelerate structure-based drug design (SBDD); however, their empirical performance turns out to be suboptimal. We seek to better understand this phenomenon from both theoretical and empirical perspectives. Since most of these models apply graph neural networks (GNNs), one may suspect that they inherit the representational limitations of GNNs. We analyze this aspect, establishing the first such results for protein-ligand complexes. A plausible counterview may attribute the underperformance of these models to their excessive parameterizations, inducing expressivity at the expense of generalization. We investigate this possibility with a simple metric-aware approach that learns an economical surrogate for affinity to infer an unlabelled molecular graph and optimizes for labels conditioned on this graph and molecular properties. The resulting model achieves state-of-the-art results using 100x fewer trainable parameters and affords up to 1000x speedup. Collectively, our findings underscore the need to reassess and redirect the existing paradigm and efforts for SBDD. Code is available at https://github.com/rafalkarczewski/SimpleSBDD.
