Table of Contents
Fetching ...

How Evaluation Choices Distort the Outcome of Generative Drug Discovery

Rıza Özçelik, Francesca Grisoni

TL;DR

It is found that the number of designs can distort scientific outcomes related to distributional similarity and diversity, and it is shown that using larger design libraries than are typically adopted helps to avoid this pitfall, and efficient algorithms are developed to enable large-scale studies.

Abstract

"How to evaluate the de novo designs proposed by a generative model?" Despite the transformative potential of generative deep learning in drug discovery, this seemingly simple question has no clear answer. The absence of standardized guidelines challenges both the benchmarking of generative approaches and the selection of molecules for prospective studies. In this work, we take a fresh - critical and constructive - perspective on de novo design evaluation. By training chemical language models, we analyze approximately 1 billion molecule designs and discover principles consistent across different neural networks and datasets. We uncover a key confounder: the size of the generated molecular library significantly impacts evaluation outcomes, often leading to misleading model comparisons. We find increasing the number of designs as a remedy and propose new and compute-efficient metrics to compute at large-scale. We also identify critical pitfalls in commonly used metrics - such as uniqueness and distributional similarity - that can distort assessments of generative performance. To address these issues, we propose new and refined strategies for reliable model comparison and design evaluation. Furthermore, when examining molecule selection and sampling strategies, our findings reveal the constraints to diversify the generated libraries and draw new parallels and distinctions between deep learning and drug discovery. We anticipate our findings to help reshape evaluation pipelines in generative drug discovery, paving the way for more reliable and reproducible generative modeling approaches.

How Evaluation Choices Distort the Outcome of Generative Drug Discovery

TL;DR

It is found that the number of designs can distort scientific outcomes related to distributional similarity and diversity, and it is shown that using larger design libraries than are typically adopted helps to avoid this pitfall, and efficient algorithms are developed to enable large-scale studies.

Abstract

"How to evaluate the de novo designs proposed by a generative model?" Despite the transformative potential of generative deep learning in drug discovery, this seemingly simple question has no clear answer. The absence of standardized guidelines challenges both the benchmarking of generative approaches and the selection of molecules for prospective studies. In this work, we take a fresh - critical and constructive - perspective on de novo design evaluation. By training chemical language models, we analyze approximately 1 billion molecule designs and discover principles consistent across different neural networks and datasets. We uncover a key confounder: the size of the generated molecular library significantly impacts evaluation outcomes, often leading to misleading model comparisons. We find increasing the number of designs as a remedy and propose new and compute-efficient metrics to compute at large-scale. We also identify critical pitfalls in commonly used metrics - such as uniqueness and distributional similarity - that can distort assessments of generative performance. To address these issues, we propose new and refined strategies for reliable model comparison and design evaluation. Furthermore, when examining molecule selection and sampling strategies, our findings reveal the constraints to diversify the generated libraries and draw new parallels and distinctions between deep learning and drug discovery. We anticipate our findings to help reshape evaluation pipelines in generative drug discovery, paving the way for more reliable and reproducible generative modeling approaches.
Paper Structure (17 sections, 2 equations, 17 figures, 2 tables)

This paper contains 17 sections, 2 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: Number of de novo designs as a key confounder - similarity to existing molecules. Frechét ChemNet Distance (FCD, a) and Frechét Descriptor Distance (FDD, b) are measured in increasing library sizes. Solid lines denote the median distance between design libraries and respective fine-tuning sets, across five repetitions ($n=5$), and shaded regions display the first and third quartiles. Dashed lines display the median distance of held-out actives ($n=128$) and inactives (100 $\leq n \leq 1280$), to the training sets.
  • Figure 2: Number of de novo designs as a key confounder - internal diversity. Three internal diversity metrics are measured in increasing library sizes. (a) Uniqueness, that is, the fraction of distinct designs among the chemically-valid ones. (b) Number of clusters, computed via sphere exclusion clustering, denotes the number of structurally distant molecules in the library. (c) Number of substructures, i.e., number of unique Morgan keys rogers2010extended. For all figures, lines display the median score measured across five fine-tuning repetitions ($n$=5) and the shaded regions show the first and third quartiles.
  • Figure 3: Navigating large design libraries. We bin the designs per protein target into ten increasing likelihood bins and compute metrics for the designs in each decile. (a) Fraction of valid (validity) and unique molecules not in the respective training set (novelty) are computed. The lines represent the median across five fine-tuning campaigns, and the shaded regions mark the first and third quartiles. (b) Structural similarity of the designs to the training set per decile is computed via Tanimoto similarity over extended connectivity fingerprints rogers2010extended. Similarities are pooled across five repetitions and visualized as a box plot. (c) The diversity in each decile is computed via the number of substructures. Bar heights denote the median across runs, while the error bars mark the first and third quartiles.
  • Figure 4: Likelihoods and model hallucinations. Designs of LSTM trained on a DRD3 dataset are binned into increasing likelihood deciles. (a) The most repeating generic Bemis-Murcko scaffold is visualized for deciles 1, 4, 7, and 10, as well as the training set. The number below each scaffold denotes its frequency in the library. (b) Highly frequent (sampled more than ten times) and least likely designs. Maximum structural similarity to the fine-tuning sets is reported (Tanimoto similarity on extended connectivity fingerprints) below.
  • Figure 5: Benchmarking molecule sampling strategies. The fine-tuned LSTM models across datasets are sampled using temperature, top-$k$, and top-$p$ sampling, at different temperatures, $k$, and $p$ values. 1,000,000 designs are produced per dataset split and sampling parameter combination. (a) Syntactic quality of the designs as measured by the fraction of valid (validity) and unique and novel compounds (novelty). The lines denote the median across five repetitions and the borders of the shaded areas display first and third quartiles. (b) Maximum structural similarity of each design to the respective training set is computed (as Tanimoto similarity on extended connectivity fingerprints rogers2010extended) and the values across dataset splits are visualized as boxplots (n$\approx$ 5,000,000). (c) Diversity of the designs is measured via the number of structures, i.e., the number of unique Morgan keys identified rogers2010extended. The lines denote the median of five runs and the shaded regions denote the inter-quartile ranges.
  • ...and 12 more figures