Refine Drugs, Don't Complete Them: Uniform-Source Discrete Flows for Fragment-Based Drug Discovery
Benno Kaech, Luis Wyss, Karsten Borgwardt, Gianvito Grasso
TL;DR
This work introduces InVirtuoGen, a uniform-source discrete-flow model for fragmented SMILES designed to guide fragment-based drug discovery from hit generation to lead optimization. By refining all sequence positions at every denoising step and decoupling sampling steps from sequence length, the method achieves a superior quality-diversity frontier in de novo generation and competitive results on fragment-constrained tasks. A hybrid optimization stack combining a genetic algorithm with Proximal Property Optimization-inspired fine-tuning enhances target-property optimization on the PMO benchmark and yields improved docking scores in lead optimization. The framework supports end-to-end drug-discovery workflows and is accompanied by open-source pretrained checkpoints and code to enable reproducibility and broader adoption.
Abstract
We introduce InVirtuoGen, a discrete flow generative model for fragmented SMILES for de novo and fragment-constrained generation, and target-property/lead optimization of small molecules. The model learns to transform a uniform source over all possible tokens into the data distribution. Unlike masked models, its training loss accounts for predictions on all sequence positions at every denoising step, shifting the generation paradigm from completion to refinement, and decoupling the number of sampling steps from the sequence length. For \textit{de novo} generation, InVirtuoGen achieves a stronger quality-diversity pareto frontier than prior fragment-based models and competitive performance on fragment-constrained tasks. For property and lead optimization, we propose a hybrid scheme that combines a genetic algorithm with a Proximal Property Optimization fine-tuning strategy adapted to discrete flows. Our approach sets a new state-of-the-art on the Practical Molecular Optimization benchmark, measured by top-10 AUC across tasks, and yields higher docking scores in lead optimization than previous baselines. InVirtuoGen thus establishes a versatile generative foundation for drug discovery, from early hit finding to multi-objective lead optimization. We further contribute to open science by releasing pretrained checkpoints and code, making our results fully reproducible\footnote{https://github.com/invirtuolabs/InVirtuoGen_results}.
