Table of Contents
Fetching ...

SynCoGen: Synthesizable 3D Molecule Generation via Joint Reaction and Coordinate Modeling

Andrei Rekesh, Miruna Cretu, Dmytro Shevchuk, Vignesh Ram Somnath, Pietro Liò, Robert A. Batey, Mike Tyers, Michał Koziarski, Cheng-Hao Liu

TL;DR

SynCoGen tackles the bottleneck of generating chemically realizable 3D molecules by jointly modeling synthesis routes and atomic coordinates. It introduces a joint diffusion-flow framework that combines discrete graph diffusion for building-block reactions with continuous flow for coordinates, trained on the SynSpace dataset of synthesis-aware graphs and conformers. The approach delivers state-of-the-art results in unconditional 3D molecule generation, supports fragment linking and pharmacophore-conditioned design in zero-shot settings, and provides explicit synthetic routes for generated structures. This work paves the way for practical, synthesis-aware de novo molecular design with direct utility in lead optimization and analog expansion.

Abstract

Synthesizability remains a critical bottleneck in generative molecular design. While recent advances have addressed synthesizability in 2D graphs, extending these constraints to 3D for geometry-based conditional generation remains largely unexplored. In this work, we present SynCoGen (Synthesizable Co-Generation), a single framework that combines simultaneous masked graph diffusion and flow matching for synthesizable 3D molecule generation. SynCoGen samples from the joint distribution of molecular building blocks, chemical reactions, and atomic coordinates. To train the model, we curated SynSpace, a dataset family containing over 1.2M synthesis-aware building block graphs and 7.5M conformers. SynCoGen achieves state-of-the-art performance in unconditional small molecule graph and conformer co-generation. For protein ligand generation in drug discovery, the amortized model delivers superior performance in both molecular linker design and pharmacophore-conditioned generation across diverse targets without relying on any scoring functions. Overall, this multimodal non-autoregressive formulation represents a foundation for a range of molecular design applications, including analog expansion, lead optimization, and direct de novo design.

SynCoGen: Synthesizable 3D Molecule Generation via Joint Reaction and Coordinate Modeling

TL;DR

SynCoGen tackles the bottleneck of generating chemically realizable 3D molecules by jointly modeling synthesis routes and atomic coordinates. It introduces a joint diffusion-flow framework that combines discrete graph diffusion for building-block reactions with continuous flow for coordinates, trained on the SynSpace dataset of synthesis-aware graphs and conformers. The approach delivers state-of-the-art results in unconditional 3D molecule generation, supports fragment linking and pharmacophore-conditioned design in zero-shot settings, and provides explicit synthetic routes for generated structures. This work paves the way for practical, synthesis-aware de novo molecular design with direct utility in lead optimization and analog expansion.

Abstract

Synthesizability remains a critical bottleneck in generative molecular design. While recent advances have addressed synthesizability in 2D graphs, extending these constraints to 3D for geometry-based conditional generation remains largely unexplored. In this work, we present SynCoGen (Synthesizable Co-Generation), a single framework that combines simultaneous masked graph diffusion and flow matching for synthesizable 3D molecule generation. SynCoGen samples from the joint distribution of molecular building blocks, chemical reactions, and atomic coordinates. To train the model, we curated SynSpace, a dataset family containing over 1.2M synthesis-aware building block graphs and 7.5M conformers. SynCoGen achieves state-of-the-art performance in unconditional small molecule graph and conformer co-generation. For protein ligand generation in drug discovery, the amortized model delivers superior performance in both molecular linker design and pharmacophore-conditioned generation across diverse targets without relying on any scoring functions. Overall, this multimodal non-autoregressive formulation represents a foundation for a range of molecular design applications, including analog expansion, lead optimization, and direct de novo design.

Paper Structure

This paper contains 81 sections, 26 equations, 20 figures, 10 tables, 5 algorithms.

Figures (20)

  • Figure 1: SynCoGen is a simultaneous masked graph diffusion and flow matching model that generates synthesizable molecules in 3D coordinate space. Each node corresponds to a building block, and edges encode chemical reactions. Note that graphs are not necessarily path graphs, the leaving groups are not displayed, and there is no order to which nodes and edges are denoised.
  • Figure 2: Overview of SynSpace creation process. Highly synthesizable molecules are procedurally constructed by iteratively sampling synthesis pathways from a set of building blocks and reactions. Starting from an initial block, the procedure selects a reaction center, a compatible reaction, and a suitable reactant. After the final structure is assembled, multiple low-energy 3D conformations are generated. We provide two SynSpace datasets from two vocabularies, a practically focused core set and an extended variant; each dataset contains 600k graphs with 3-4M conformers.
  • Figure 3: Conformer geometry and energy distribution. Distributions of a) bond lengths, b-c) dihedral angles, d) average per-atom GFN-FF non-bonded interaction energies. Solid curves denote training data densities; lower subpanels in (a-c) show deviations between generated samples and data.
  • Figure 4: Molecular inpainting. a) Fragment linking with three ligands in the PDB that contain substructure matches with our building blocks. For each structure, we show three examples of linkers generated by SynCoGen and the distribution of Vina docking scores (lower is better). b) Proposed synthesis pathway for molecule (1) sampled from our model and c) structure of (1) (blue) docked onto PDB 7N7X using AlphaFold3 compared against the PDB ligand (beige).
  • Figure 5: Pharmacophore-conditioned generation. Top: Docking score comparison on 10 targets from the PDB/LIT-PCBA benchmark (lower is better). Inset: target wins by method, where SynCoGen achieves the best docking score on 8/10 targets (best sample) and 7/10 (median). Bottom left: Aggregated conditional generation metrics for all 10 targets. Bottom right: Docked SynCoGen-generated molecules (green) overlaid with PDB ligand (magenta) for 5L2M, 5FV7 and 3ZME.
  • ...and 15 more figures