Table of Contents
Fetching ...

What Ails Generative Structure-based Drug Design: Expressivity is Too Little or Too Much?

Rafał Karczewski, Samuel Kaski, Markus Heinonen, Vikas Garg

TL;DR

The paper investigates why generative models for structure-based drug design underperform and argues that both GNN expressivity limits and overparameterization contribute to the gap. It formalizes expressivity limits for LU‑GNNs in protein–ligand contexts and introduces SimpleSBDD, a two‑phase, performance‑aware framework that decouples the unlabelled molecular graph from atom types and optimizes for binding affinity via an economical surrogate. Through extensive experiments on CrossDocked2020 and drug repurposing tasks, SimpleSBDD achieves state‑of‑the‑art docking performance with up to $1000\times$ faster runtimes and about $100\times$ fewer parameters, demonstrating the practicality of targeted optimization over raw expressivity. The findings advocate rethinking SBDD toward robust generalization and computational efficiency, with potential to streamline docking pipelines while highlighting the need for more reliable validation beyond docking scores.

Abstract

Several generative models with elaborate training and sampling procedures have been proposed to accelerate structure-based drug design (SBDD); however, their empirical performance turns out to be suboptimal. We seek to better understand this phenomenon from both theoretical and empirical perspectives. Since most of these models apply graph neural networks (GNNs), one may suspect that they inherit the representational limitations of GNNs. We analyze this aspect, establishing the first such results for protein-ligand complexes. A plausible counterview may attribute the underperformance of these models to their excessive parameterizations, inducing expressivity at the expense of generalization. We investigate this possibility with a simple metric-aware approach that learns an economical surrogate for affinity to infer an unlabelled molecular graph and optimizes for labels conditioned on this graph and molecular properties. The resulting model achieves state-of-the-art results using 100x fewer trainable parameters and affords up to 1000x speedup. Collectively, our findings underscore the need to reassess and redirect the existing paradigm and efforts for SBDD. Code is available at https://github.com/rafalkarczewski/SimpleSBDD.

What Ails Generative Structure-based Drug Design: Expressivity is Too Little or Too Much?

TL;DR

The paper investigates why generative models for structure-based drug design underperform and argues that both GNN expressivity limits and overparameterization contribute to the gap. It formalizes expressivity limits for LU‑GNNs in protein–ligand contexts and introduces SimpleSBDD, a two‑phase, performance‑aware framework that decouples the unlabelled molecular graph from atom types and optimizes for binding affinity via an economical surrogate. Through extensive experiments on CrossDocked2020 and drug repurposing tasks, SimpleSBDD achieves state‑of‑the‑art docking performance with up to faster runtimes and about fewer parameters, demonstrating the practicality of targeted optimization over raw expressivity. The findings advocate rethinking SBDD toward robust generalization and computational efficiency, with potential to streamline docking pipelines while highlighting the need for more reliable validation beyond docking scores.

Abstract

Several generative models with elaborate training and sampling procedures have been proposed to accelerate structure-based drug design (SBDD); however, their empirical performance turns out to be suboptimal. We seek to better understand this phenomenon from both theoretical and empirical perspectives. Since most of these models apply graph neural networks (GNNs), one may suspect that they inherit the representational limitations of GNNs. We analyze this aspect, establishing the first such results for protein-ligand complexes. A plausible counterview may attribute the underperformance of these models to their excessive parameterizations, inducing expressivity at the expense of generalization. We investigate this possibility with a simple metric-aware approach that learns an economical surrogate for affinity to infer an unlabelled molecular graph and optimizes for labels conditioned on this graph and molecular properties. The resulting model achieves state-of-the-art results using 100x fewer trainable parameters and affords up to 1000x speedup. Collectively, our findings underscore the need to reassess and redirect the existing paradigm and efforts for SBDD. Code is available at https://github.com/rafalkarczewski/SimpleSBDD.
Paper Structure (51 sections, 4 theorems, 42 equations, 11 figures, 7 tables)

This paper contains 51 sections, 4 theorems, 42 equations, 11 figures, 7 tables.

Key Result

Lemma 3.0

There exist connected non-isomorphic geometric graphs that differ in the number of conjoined cycles, girth, size of the largest cycle and cut-edges that LU3D-GNNs cannot distinguish.

Figures (11)

  • Figure 1: Some ligands cannot be distinguished by GNNs even with additional protein context Left: Construction for Lemma \ref{['lemma:3d_single_body']}; Two non-isomorphic graphs differing in all properties stated in Lemma \ref{['lemma:3d_single_body']}, but for which LU-GNNs produce identical embeddings. Right: Complex graphs constructed by joining the ligand graphs with the same protein graph remain identical whenever ligand graphs cannot be differentiated.
  • Figure 2: Comparison of SimpleSBDD to common approaches. Top: SBDD approaches commonly learn to approximate the data distribution of atom types and 3D coordinates conditioned on the protein pocket. Bottom: SimpleSBDD first generates the unlabelled graph explicitly optimized for estimated binding affinity. Then it predicts atom types using different strategies designed for solving different tasks independently of the protein pocket. Finally, it generates a 3D configuration.
  • Figure 3: Optimal ligand size has two modes. Representative examples showing that the scoring model learns the optimal ligand size which is different for different proteins.
  • Figure 4: SimpleSBDD generates diverse drug candidates with stronger predicted binding than reference molecules. We visualize the predictions of our model for two randomly chosen proteins (PDB ids 3gs6 and 4azf). We choose 5 molecules with best predicted binding affinities for each protein and compare their Vina score, QED and synthetic accessibility with the reference molecule.
  • Figure 5: Impact of different transformations of the ligand on the Vina score.
  • ...and 6 more figures

Theorems & Definitions (8)

  • Lemma 3.0
  • Proposition 3.0
  • Definition K.1: Indistinguishability
  • Definition K.2: Computation trees
  • Lemma K.2
  • proof
  • Proposition K.2
  • proof