Table of Contents
Fetching ...

Refine Drugs, Don't Complete Them: Uniform-Source Discrete Flows for Fragment-Based Drug Discovery

Benno Kaech, Luis Wyss, Karsten Borgwardt, Gianvito Grasso

TL;DR

This work introduces InVirtuoGen, a uniform-source discrete-flow model for fragmented SMILES designed to guide fragment-based drug discovery from hit generation to lead optimization. By refining all sequence positions at every denoising step and decoupling sampling steps from sequence length, the method achieves a superior quality-diversity frontier in de novo generation and competitive results on fragment-constrained tasks. A hybrid optimization stack combining a genetic algorithm with Proximal Property Optimization-inspired fine-tuning enhances target-property optimization on the PMO benchmark and yields improved docking scores in lead optimization. The framework supports end-to-end drug-discovery workflows and is accompanied by open-source pretrained checkpoints and code to enable reproducibility and broader adoption.

Abstract

We introduce InVirtuoGen, a discrete flow generative model for fragmented SMILES for de novo and fragment-constrained generation, and target-property/lead optimization of small molecules. The model learns to transform a uniform source over all possible tokens into the data distribution. Unlike masked models, its training loss accounts for predictions on all sequence positions at every denoising step, shifting the generation paradigm from completion to refinement, and decoupling the number of sampling steps from the sequence length. For \textit{de novo} generation, InVirtuoGen achieves a stronger quality-diversity pareto frontier than prior fragment-based models and competitive performance on fragment-constrained tasks. For property and lead optimization, we propose a hybrid scheme that combines a genetic algorithm with a Proximal Property Optimization fine-tuning strategy adapted to discrete flows. Our approach sets a new state-of-the-art on the Practical Molecular Optimization benchmark, measured by top-10 AUC across tasks, and yields higher docking scores in lead optimization than previous baselines. InVirtuoGen thus establishes a versatile generative foundation for drug discovery, from early hit finding to multi-objective lead optimization. We further contribute to open science by releasing pretrained checkpoints and code, making our results fully reproducible\footnote{https://github.com/invirtuolabs/InVirtuoGen_results}.

Refine Drugs, Don't Complete Them: Uniform-Source Discrete Flows for Fragment-Based Drug Discovery

TL;DR

This work introduces InVirtuoGen, a uniform-source discrete-flow model for fragmented SMILES designed to guide fragment-based drug discovery from hit generation to lead optimization. By refining all sequence positions at every denoising step and decoupling sampling steps from sequence length, the method achieves a superior quality-diversity frontier in de novo generation and competitive results on fragment-constrained tasks. A hybrid optimization stack combining a genetic algorithm with Proximal Property Optimization-inspired fine-tuning enhances target-property optimization on the PMO benchmark and yields improved docking scores in lead optimization. The framework supports end-to-end drug-discovery workflows and is accompanied by open-source pretrained checkpoints and code to enable reproducibility and broader adoption.

Abstract

We introduce InVirtuoGen, a discrete flow generative model for fragmented SMILES for de novo and fragment-constrained generation, and target-property/lead optimization of small molecules. The model learns to transform a uniform source over all possible tokens into the data distribution. Unlike masked models, its training loss accounts for predictions on all sequence positions at every denoising step, shifting the generation paradigm from completion to refinement, and decoupling the number of sampling steps from the sequence length. For \textit{de novo} generation, InVirtuoGen achieves a stronger quality-diversity pareto frontier than prior fragment-based models and competitive performance on fragment-constrained tasks. For property and lead optimization, we propose a hybrid scheme that combines a genetic algorithm with a Proximal Property Optimization fine-tuning strategy adapted to discrete flows. Our approach sets a new state-of-the-art on the Practical Molecular Optimization benchmark, measured by top-10 AUC across tasks, and yields higher docking scores in lead optimization than previous baselines. InVirtuoGen thus establishes a versatile generative foundation for drug discovery, from early hit finding to multi-objective lead optimization. We further contribute to open science by releasing pretrained checkpoints and code, making our results fully reproducible\footnote{https://github.com/invirtuolabs/InVirtuoGen_results}.

Paper Structure

This paper contains 38 sections, 8 equations, 15 figures, 5 tables, 2 algorithms.

Figures (15)

  • Figure 1: Comparison of generation paradigms: (a) autoregressive models generate tokens sequentially (here simplified by omitting BOS/EOS tokens), (b) masked diffusion models iteratively reveal masked positions, and (c) discrete flows refine all positions starting from a uniform source distribution, where shading indicates the transition from random tokens to data.
  • Figure 2: Comparison between SMILES, SAFE, and our notation for the same molecule. Our notation preserves fragment integrity while providing explicit attachment point numbering that facilitates bidirectional modeling of molecular structure.
  • Figure 3: Quality-diversity trade-off for GenMol, SAFE-GPT (single point, as no quality-diversity scan data is available), and our model at different simulation time granularities ($h \in \{0.1, 0.01, 0.001\}$). Curves correspond to varying sampling noise $(T,r)$, where $T$ is the softmax temperature and $r$ is the Gumbel noise scale.
  • Figure 4: Sequence length distribution of ZINC250K. The maximum observed length is 84, which implies that a masked model requires at least 84 sampling steps, putting it close to our step size $h=0.01$.
  • Figure 5: Non curated samples for de novo generation
  • ...and 10 more figures