A collaborative constrained graph diffusion model for the generation of realistic synthetic molecules
Manuel Ruiz-Botella, Marta Sales-Pardo, Roger Guimerà
TL;DR
CoCoGraph introduces a collaborative constrained discrete diffusion model for generating chemically valid, diverse molecules. By embedding valence constraints directly into a discrete double edge swap diffusion and employing a collaborative time model to guide denoising, the approach achieves 100% chemical validity with orders of magnitude fewer parameters than prior models. Comprehensive evaluation shows generated molecules closely match real molecular property distributions across 36 characteristics, and a large 8.2 million molecule database demonstrates high novelty and plausibility, validated in a Turing-like expert test. The method enables scalable, chemistry-grounded molecular generation with potential impact on drug discovery, materials design, and catalysis, supported by open data and code.
Abstract
Developing new molecular compounds is crucial to address pressing challenges, from health to environmental sustainability. However, exploring the molecular space to discover new molecules is difficult due to the vastness of the space. Here we introduce CoCoGraph, a collaborative and constrained graph diffusion model capable of generating molecules that are guaranteed to be chemically valid. Thanks to the constraints built into the model and to the collaborative mechanism, CoCoGraph outperforms state-of-the-art approaches on standard benchmarks while requiring up to an order of magnitude fewer parameters. Analysis of 36 chemical properties also demonstrates that CoCoGraph generates molecules with distributions more closely matching real molecules than current models. Leveraging the model's efficiency, we created a database of 8.2M million synthetically generated molecules and conducted a Turing-like test with organic chemistry experts to further assess the plausibility of the generated molecules, and potential biases and limitations of CoCoGraph.
