Table of Contents
Fetching ...

GenMol: A Drug Discovery Generalist with Discrete Diffusion

Seul Lee, Karsten Kreis, Srimukh Prasad Veccham, Meng Liu, Danny Reidenbach, Yuxing Peng, Saee Paliwal, Weili Nie, Arash Vahdat

TL;DR

GenMol introduces a unified framework for drug discovery by casting molecular design as masked discrete diffusion over SAFE fragment sequences. Key innovations include fragment remasking to explore chemical space beyond a fixed fragment vocabulary and molecular context guidance to leverage context during generation. The approach delivers state-of-the-art or near-state-of-the-art performance across de novo, fragment-constrained, goal-directed hit generation, and lead optimization tasks without task-specific fine-tuning. The work demonstrates a practical, efficient, and versatile foundation model for molecular design with broad downstream impact in medicinal chemistry.

Abstract

Drug discovery is a complex process that involves multiple stages and tasks. However, existing molecular generative models can only tackle some of these tasks. We present Generalist Molecular generative model (GenMol), a versatile framework that uses only a single discrete diffusion model to handle diverse drug discovery scenarios. GenMol generates Sequential Attachment-based Fragment Embedding (SAFE) sequences through non-autoregressive bidirectional parallel decoding, thereby allowing the utilization of a molecular context that does not rely on the specific token ordering while having better sampling efficiency. GenMol uses fragments as basic building blocks for molecules and introduces fragment remasking, a strategy that optimizes molecules by regenerating masked fragments, enabling effective exploration of chemical space. We further propose molecular context guidance (MCG), a guidance method tailored for masked discrete diffusion of GenMol. GenMol significantly outperforms the previous GPT-based model in de novo generation and fragment-constrained generation, and achieves state-of-the-art performance in goal-directed hit generation and lead optimization. These results demonstrate that GenMol can tackle a wide range of drug discovery tasks, providing a unified and versatile approach for molecular design. Our code is available at https://github.com/NVIDIA-Digital-Bio/genmol.

GenMol: A Drug Discovery Generalist with Discrete Diffusion

TL;DR

GenMol introduces a unified framework for drug discovery by casting molecular design as masked discrete diffusion over SAFE fragment sequences. Key innovations include fragment remasking to explore chemical space beyond a fixed fragment vocabulary and molecular context guidance to leverage context during generation. The approach delivers state-of-the-art or near-state-of-the-art performance across de novo, fragment-constrained, goal-directed hit generation, and lead optimization tasks without task-specific fine-tuning. The work demonstrates a practical, efficient, and versatile foundation model for molecular design with broad downstream impact in medicinal chemistry.

Abstract

Drug discovery is a complex process that involves multiple stages and tasks. However, existing molecular generative models can only tackle some of these tasks. We present Generalist Molecular generative model (GenMol), a versatile framework that uses only a single discrete diffusion model to handle diverse drug discovery scenarios. GenMol generates Sequential Attachment-based Fragment Embedding (SAFE) sequences through non-autoregressive bidirectional parallel decoding, thereby allowing the utilization of a molecular context that does not rely on the specific token ordering while having better sampling efficiency. GenMol uses fragments as basic building blocks for molecules and introduces fragment remasking, a strategy that optimizes molecules by regenerating masked fragments, enabling effective exploration of chemical space. We further propose molecular context guidance (MCG), a guidance method tailored for masked discrete diffusion of GenMol. GenMol significantly outperforms the previous GPT-based model in de novo generation and fragment-constrained generation, and achieves state-of-the-art performance in goal-directed hit generation and lead optimization. These results demonstrate that GenMol can tackle a wide range of drug discovery tasks, providing a unified and versatile approach for molecular design. Our code is available at https://github.com/NVIDIA-Digital-Bio/genmol.
Paper Structure (52 sections, 13 equations, 9 figures, 17 tables, 1 algorithm)

This paper contains 52 sections, 13 equations, 9 figures, 17 tables, 1 algorithm.

Figures (9)

  • Figure 1: Results on drug discovery tasks. The values are quality, average quality, sum AUC top-10, and success rate for de novo generation, fragment-constrained generation, hit generation, and lead optimization, respectively. The "best baseline" refers to multiple best-performing task-specific models among prior works.
  • Figure 1: De novo molecule generation results. The results are the means and the standard deviations of 3 runs. $N$, $\tau$, and $r$ is the number of tokens to unmask at each time step, the softmax temperature, and the randomness, respectively. The best results are highlighted in bold.
  • Figure 2: (a) GenMol architecture. GenMol adopts the BERT architecture and is trained with the NELBO loss of masked discrete diffusion. (b) Generation process of GenMol. Under masked discrete diffusion, GenMol completes a molecule by simulating backward in time and predicting masked tokens at each time step $t$ until all tokens are unmasked. (c) Illustration of various drug discovery tasks that can be performed by GenMol. GenMol is endowed with the ability to easily perform (c1) de novo generation, (c2-c5) fragment-constrained generation, and (c6) fragment remasking that can be applied to goal-directed hit generation and lead optimization.
  • Figure 3: (a) Goal-directed hit generation and lead optimization process with GenMol. An initial fragment vocabulary is constructed by decomposing an existing molecular dataset (hit generation) or a seed molecule (lead optimization). Two fragments are randomly sampled from the vocabulary and attached, and GenMol performs fragment remasking. The fragment vocabulary is updated with the generated molecules for the next iteration. (b) Illustration of the molecular optimization trajectory with fragment remasking. With fragment remasking, GenMol can explore beyond the initial fragment vocabulary to find chemical optima.
  • Figure 4: The quality-diversity trade-off in de novo generation with different values of $(\tau,r)$.
  • ...and 4 more figures