From Theory to Therapy: Reframing SBDD Model Evaluation via Practical Metrics

Bowen Gao; Haichuan Tan; Yanwen Huang; Minsi Ren; Xiao Huang; Wei-Ying Ma; Ya-Qin Zhang; Yanyan Lan

From Theory to Therapy: Reframing SBDD Model Evaluation via Practical Metrics

Bowen Gao, Haichuan Tan, Yanwen Huang, Minsi Ren, Xiao Huang, Wei-Ying Ma, Ya-Qin Zhang, Yanyan Lan

TL;DR

This work tackles the gap between theoretically strong structure-based drug design (SBDD) outputs and their real-world utility by shifting evaluation from sole reliance on $Vina$ docking scores to a three-tier framework: binding affinity estimation, similarity to known actives and FDA-approved drugs, and virtual screening ability. It introduces detailed metrics, including $\Delta$ score, DrugCLIP, Active/FDA Similarity, and BEDROC/$\mathrm{EF}$, and constructs a realistic benchmark from real crystal structures (PDBbind) with rigorous pocket-diversity and a test set drawn from DUD-E and LIT-PCBA. Comprehensive experiments across five baselines (LiGAN, AR, Pocket2Mol, TargetDiff, MolCRAFT) reveal that high docking scores often do not translate to practical usefulness, highlighting a substantial gap between theory and deployment. The proposed dataset and metrics aim to guide future SBDD models toward outputs that are not only score-competitive but also synthesizable, drug-like, and reusable in virtual screening, thereby accelerating the pathway from theory to therapy.

Abstract

Recent advancements in structure-based drug design (SBDD) have significantly enhanced the efficiency and precision of drug discovery by generating molecules tailored to bind specific protein pockets. Despite these technological strides, their practical application in real-world drug development remains challenging due to the complexities of synthesizing and testing these molecules. The reliability of the Vina docking score, the current standard for assessing binding abilities, is increasingly questioned due to its susceptibility to overfitting. To address these limitations, we propose a comprehensive evaluation framework that includes assessing the similarity of generated molecules to known active compounds, introducing a virtual screening-based metric for practical deployment capabilities, and re-evaluating binding affinity more rigorously. Our experiments reveal that while current SBDD models achieve high Vina scores, they fall short in practical usability metrics, highlighting a significant gap between theoretical predictions and real-world applicability. Our proposed metrics and dataset aim to bridge this gap, enhancing the practical applicability of future SBDD models and aligning them more closely with the needs of pharmaceutical research and development.

From Theory to Therapy: Reframing SBDD Model Evaluation via Practical Metrics

TL;DR

This work tackles the gap between theoretically strong structure-based drug design (SBDD) outputs and their real-world utility by shifting evaluation from sole reliance on

docking scores to a three-tier framework: binding affinity estimation, similarity to known actives and FDA-approved drugs, and virtual screening ability. It introduces detailed metrics, including

score, DrugCLIP, Active/FDA Similarity, and BEDROC/

, and constructs a realistic benchmark from real crystal structures (PDBbind) with rigorous pocket-diversity and a test set drawn from DUD-E and LIT-PCBA. Comprehensive experiments across five baselines (LiGAN, AR, Pocket2Mol, TargetDiff, MolCRAFT) reveal that high docking scores often do not translate to practical usefulness, highlighting a substantial gap between theory and deployment. The proposed dataset and metrics aim to guide future SBDD models toward outputs that are not only score-competitive but also synthesizable, drug-like, and reusable in virtual screening, thereby accelerating the pathway from theory to therapy.

Abstract

Paper Structure (31 sections, 4 equations, 9 figures, 8 tables)

This paper contains 31 sections, 4 equations, 9 figures, 8 tables.

Introduction
Related Work
Methods
Evaluation Metrics that meet practical needs
Binding Affinity Estimation
Similarity to Known Drugs and Actives
Virtual Screening Ability
Test Dataset
Training and Validation Dataset
Experiments
Tested Models
Results of Binding Ability Estimation
Results of Similarity-Based Metrics
Results of Virtual Screening-Based Metircs
Results of LIT-PCBA dataset
...and 16 more sections

Figures (9)

Figure 1: The relationship between Vina Dock Score/QED and number of atoms
Figure 2: Our three-level evaluation metrics include: (a) Binding affinity estimation, which encompasses the Vina docking score, delta score, and DrugCLIP score; (b) Similarity-based metrics that assess the resemblance between generated molecules and known actives, as well as drugs on the FDA-approved list; (c) Virtual screening ability metrics that evaluate the capability of the generated molecules to differentiate between actives and decoys when used as reference templates.
Figure 3: (a) Distribution of Enrichment Factors of different models across different targets in DUD-E. (b) Radar plot that shows the performance of different methods on part of our multifaceted metrics.
Figure 4: Cases that show the importance of using similarity-based metrics to evaluate the effectiveness of generated molecules
Figure 5: Relationship between FLAPP Score Threshold and Dataset Sizes
...and 4 more figures

From Theory to Therapy: Reframing SBDD Model Evaluation via Practical Metrics

TL;DR

Abstract

From Theory to Therapy: Reframing SBDD Model Evaluation via Practical Metrics

Authors

TL;DR

Abstract

Table of Contents

Figures (9)