Table of Contents
Fetching ...

A collaborative constrained graph diffusion model for the generation of realistic synthetic molecules

Manuel Ruiz-Botella, Marta Sales-Pardo, Roger Guimerà

TL;DR

CoCoGraph introduces a collaborative constrained discrete diffusion model for generating chemically valid, diverse molecules. By embedding valence constraints directly into a discrete double edge swap diffusion and employing a collaborative time model to guide denoising, the approach achieves 100% chemical validity with orders of magnitude fewer parameters than prior models. Comprehensive evaluation shows generated molecules closely match real molecular property distributions across 36 characteristics, and a large 8.2 million molecule database demonstrates high novelty and plausibility, validated in a Turing-like expert test. The method enables scalable, chemistry-grounded molecular generation with potential impact on drug discovery, materials design, and catalysis, supported by open data and code.

Abstract

Developing new molecular compounds is crucial to address pressing challenges, from health to environmental sustainability. However, exploring the molecular space to discover new molecules is difficult due to the vastness of the space. Here we introduce CoCoGraph, a collaborative and constrained graph diffusion model capable of generating molecules that are guaranteed to be chemically valid. Thanks to the constraints built into the model and to the collaborative mechanism, CoCoGraph outperforms state-of-the-art approaches on standard benchmarks while requiring up to an order of magnitude fewer parameters. Analysis of 36 chemical properties also demonstrates that CoCoGraph generates molecules with distributions more closely matching real molecules than current models. Leveraging the model's efficiency, we created a database of 8.2M million synthetically generated molecules and conducted a Turing-like test with organic chemistry experts to further assess the plausibility of the generated molecules, and potential biases and limitations of CoCoGraph.

A collaborative constrained graph diffusion model for the generation of realistic synthetic molecules

TL;DR

CoCoGraph introduces a collaborative constrained discrete diffusion model for generating chemically valid, diverse molecules. By embedding valence constraints directly into a discrete double edge swap diffusion and employing a collaborative time model to guide denoising, the approach achieves 100% chemical validity with orders of magnitude fewer parameters than prior models. Comprehensive evaluation shows generated molecules closely match real molecular property distributions across 36 characteristics, and a large 8.2 million molecule database demonstrates high novelty and plausibility, validated in a Turing-like expert test. The method enables scalable, chemistry-grounded molecular generation with potential impact on drug discovery, materials design, and catalysis, supported by open data and code.

Abstract

Developing new molecular compounds is crucial to address pressing challenges, from health to environmental sustainability. However, exploring the molecular space to discover new molecules is difficult due to the vastness of the space. Here we introduce CoCoGraph, a collaborative and constrained graph diffusion model capable of generating molecules that are guaranteed to be chemically valid. Thanks to the constraints built into the model and to the collaborative mechanism, CoCoGraph outperforms state-of-the-art approaches on standard benchmarks while requiring up to an order of magnitude fewer parameters. Analysis of 36 chemical properties also demonstrates that CoCoGraph generates molecules with distributions more closely matching real molecules than current models. Leveraging the model's efficiency, we created a database of 8.2M million synthetically generated molecules and conducted a Turing-like test with organic chemistry experts to further assess the plausibility of the generated molecules, and potential biases and limitations of CoCoGraph.

Paper Structure

This paper contains 15 sections, 6 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Constrained collaborative graph diffusion model, CoCoGraph.a, Constrained diffusion process. We introduce noise in the molecular graph by swapping two chemical bonds at each step. We then train diffusion and time models to revert this process. b, Diffusion model. At each step, it receives molecular features and the timestep as an input and assigns a score to all possibilities of edge swaps. c, Time model. It receives molecular features and estimates the time step of the current molecular graph. d, Sampling. We use trained diffusion and time models in collaboration to generate a trajectory of denoising starting from a random molecular graph with a defined molecular formula. We then select the molecule with the smallest predicted time as the generated molecule.
  • Figure 2: Performance comparison on GuacaMol benchmark properties.a-f, Distributions of six molecular properties: a, molecular weight; b, molecular logP; c, internal similarity; d, Bertz complexity index; e, number of aromatic rings; and f, number of H-bond donors. For each property, the distribution of values calculated for molecules generated by CoCoGraph (black line) is compared to that of the original molecules (green distribution), and to those of molecules generated by JTVAE (purple dashed line) and DiGress (orange dashed line). Jensen-Shannon (JS) distance values between each model and the original distribution are shown. g, Summary comparison based on the log2 ratio of JS distances between CoCoGraph and comparator models for the properties in ( a-f). Positive values indicate CoCoGraph outperforms the comparator model and vice versa.
  • Figure 3: Detailed performance comparison on a subset of 36 chemical properties.a-j, Distributions of ten molecular properties: a, heavy atom count; b, number of valence electrons; c, NOCount; d, Balaban’s J Index; e, number of H acceptors; f, ring count; g, topological polar surface area (TPSA); h, quantitative estimate of drug-likeness (QED); i, maximum absolute partial charge; and j, NHOHCount. For each property, the distributions for molecules generated by the CocoGraph FPS model (black line) is compared to that of the original molecules (green distribution) and to those of molecules generated by JTVAE (purple line) and DiGress (orange line). k, log2 ratio of JS distances between CocoGraph FPS and the other models, where a positive value indicates that CocoGraph FPS outperforms the comparative model.
  • Figure 4: Performance in the Turing-like test. We assess the performance of participants in the Turing-like test by computing their accuracy at correctly identifying the original, non-generated molecule over all attempts. Error bars represent the standard error of the mean calculated via bootstrapping. a, Overall accuracy of participants in the Turing-like test. b, Accuracy by level of education in organic chemistry. c, Accuracy by molecular size in terms of the number fo atoms. d, Accuracy by ring type. e, Accuracy by conformational flexibility of the molecules. f, Accuracy by bond type. g, Accuracy by functional group type.
  • Figure 5: Architecture of CoCoGraph components.a, The diffusion model processes the molecular graph through a sequence of EnhancedGINE layers, the embedding of pairs of nodes are concatenated with edge properties and processed through two feedforward modules to predict the probability of bond formation and bond breakage for each possible double edge swapping operation. b, The time model estimates the diffusion timestep $t$ of the current molecular graph using processed node embeddings obtained after applying the EnhancedGINE module to the features of the molecular graph. c, The message passing component of both models, the EnhancedGINE module. d, The prediction component of the diffusion model.
  • ...and 5 more figures