Table of Contents
Fetching ...

WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking

Yunchao Liu, Ha Dong, Xin Wang, Rocco Moretti, Yu Wang, Zhaoqian Su, Jiawei Gu, Bobby Bodenheimer, Charles David Weaver, Jens Meiler, Tyler Derr

TL;DR

The paper addresses the lack of robust benchmarking in AI-driven small-molecule drug discovery by introducing WelQrate, a gold-standard framework built on rigorously curated datasets (nine datasets across five target classes), an evaluation protocol, and comprehensive benchmarking. It combines expert-driven data curation (including PAINS filtering and hierarchical primary/confirmatory/counter screens) with standardized data formats, featurization, and 3D conformations to enable fair, realistic virtual screening assessments. Through benchmarking, the authors show how model choice, data quality, featurization, and data splits influence performance, and they highlight the enduring value of domain-informed descriptors in scaffold hopping scenarios. WelQrate aims to improve reproducibility and transferability of AI-driven drug discovery by providing transparent procedures, public data, and actionable guidelines for adoption in real-world settings.

Abstract

While deep learning has revolutionized computer-aided drug discovery, the AI community has predominantly focused on model innovation and placed less emphasis on establishing best benchmarking practices. We posit that without a sound model evaluation framework, the AI community's efforts cannot reach their full potential, thereby slowing the progress and transfer of innovation into real-world drug discovery. Thus, in this paper, we seek to establish a new gold standard for small molecule drug discovery benchmarking, WelQrate. Specifically, our contributions are threefold: WelQrate Dataset Collection - we introduce a meticulously curated collection of 9 datasets spanning 5 therapeutic target classes. Our hierarchical curation pipelines, designed by drug discovery experts, go beyond the primary high-throughput screen by leveraging additional confirmatory and counter screens along with rigorous domain-driven preprocessing, such as Pan-Assay Interference Compounds (PAINS) filtering, to ensure the high-quality data in the datasets; WelQrate Evaluation Framework - we propose a standardized model evaluation framework considering high-quality datasets, featurization, 3D conformation generation, evaluation metrics, and data splits, which provides a reliable benchmarking for drug discovery experts conducting real-world virtual screening; Benchmarking - we evaluate model performance through various research questions using the WelQrate dataset collection, exploring the effects of different models, dataset quality, featurization methods, and data splitting strategies on the results. In summary, we recommend adopting our proposed WelQrate as the gold standard in small molecule drug discovery benchmarking. The WelQrate dataset collection, along with the curation codes, and experimental scripts are all publicly available at WelQrate.org.

WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking

TL;DR

The paper addresses the lack of robust benchmarking in AI-driven small-molecule drug discovery by introducing WelQrate, a gold-standard framework built on rigorously curated datasets (nine datasets across five target classes), an evaluation protocol, and comprehensive benchmarking. It combines expert-driven data curation (including PAINS filtering and hierarchical primary/confirmatory/counter screens) with standardized data formats, featurization, and 3D conformations to enable fair, realistic virtual screening assessments. Through benchmarking, the authors show how model choice, data quality, featurization, and data splits influence performance, and they highlight the enduring value of domain-informed descriptors in scaffold hopping scenarios. WelQrate aims to improve reproducibility and transferability of AI-driven drug discovery by providing transparent procedures, public data, and actionable guidelines for adoption in real-world settings.

Abstract

While deep learning has revolutionized computer-aided drug discovery, the AI community has predominantly focused on model innovation and placed less emphasis on establishing best benchmarking practices. We posit that without a sound model evaluation framework, the AI community's efforts cannot reach their full potential, thereby slowing the progress and transfer of innovation into real-world drug discovery. Thus, in this paper, we seek to establish a new gold standard for small molecule drug discovery benchmarking, WelQrate. Specifically, our contributions are threefold: WelQrate Dataset Collection - we introduce a meticulously curated collection of 9 datasets spanning 5 therapeutic target classes. Our hierarchical curation pipelines, designed by drug discovery experts, go beyond the primary high-throughput screen by leveraging additional confirmatory and counter screens along with rigorous domain-driven preprocessing, such as Pan-Assay Interference Compounds (PAINS) filtering, to ensure the high-quality data in the datasets; WelQrate Evaluation Framework - we propose a standardized model evaluation framework considering high-quality datasets, featurization, 3D conformation generation, evaluation metrics, and data splits, which provides a reliable benchmarking for drug discovery experts conducting real-world virtual screening; Benchmarking - we evaluate model performance through various research questions using the WelQrate dataset collection, exploring the effects of different models, dataset quality, featurization methods, and data splitting strategies on the results. In summary, we recommend adopting our proposed WelQrate as the gold standard in small molecule drug discovery benchmarking. The WelQrate dataset collection, along with the curation codes, and experimental scripts are all publicly available at WelQrate.org.

Paper Structure

This paper contains 35 sections, 9 equations, 19 figures, 14 tables.

Figures (19)

  • Figure 1: An overview of the data curation pipeline.
  • Figure 2: An example of the hierarchical curation with AID 1798. Initially 63,676 compounds go through a primary screen (AID 626). The found 1,665 actives further go through a confirmatory screen (AID 1488) to verify their activities, and those showing activity in a counter screen (AID 1741) are excluded from the final active set.
  • Figure 3: Illustration of the adapted cross-valiation.
  • Figure 4: Categorical performance comparison among different models (RQ1) trained respectively with WelQrate dataset collection and control dataset (RQ2) (Note that individual model performances are shown in Fig. \ref{['fig-rq4']}). Values are averages over performance across different datasets. Error bars denote standard error across multiple experimental runs and AIDs. For simplicity, WelQrate refers to WelQrate dataset collection in the legend.
  • Figure 5: Comparison of model performance using one-hot encoding and pre-defined features in WelQrate dataset collection (RQ3). Error bars denote standard error across multiple experimental runs.
  • ...and 14 more figures