SynCoGen: Synthesizable 3D Molecule Generation via Joint Reaction and Coordinate Modeling
Andrei Rekesh, Miruna Cretu, Dmytro Shevchuk, Vignesh Ram Somnath, Pietro Liò, Robert A. Batey, Mike Tyers, Michał Koziarski, Cheng-Hao Liu
TL;DR
SynCoGen tackles the bottleneck of generating chemically realizable 3D molecules by jointly modeling synthesis routes and atomic coordinates. It introduces a joint diffusion-flow framework that combines discrete graph diffusion for building-block reactions with continuous flow for coordinates, trained on the SynSpace dataset of synthesis-aware graphs and conformers. The approach delivers state-of-the-art results in unconditional 3D molecule generation, supports fragment linking and pharmacophore-conditioned design in zero-shot settings, and provides explicit synthetic routes for generated structures. This work paves the way for practical, synthesis-aware de novo molecular design with direct utility in lead optimization and analog expansion.
Abstract
Synthesizability remains a critical bottleneck in generative molecular design. While recent advances have addressed synthesizability in 2D graphs, extending these constraints to 3D for geometry-based conditional generation remains largely unexplored. In this work, we present SynCoGen (Synthesizable Co-Generation), a single framework that combines simultaneous masked graph diffusion and flow matching for synthesizable 3D molecule generation. SynCoGen samples from the joint distribution of molecular building blocks, chemical reactions, and atomic coordinates. To train the model, we curated SynSpace, a dataset family containing over 1.2M synthesis-aware building block graphs and 7.5M conformers. SynCoGen achieves state-of-the-art performance in unconditional small molecule graph and conformer co-generation. For protein ligand generation in drug discovery, the amortized model delivers superior performance in both molecular linker design and pharmacophore-conditioned generation across diverse targets without relying on any scoring functions. Overall, this multimodal non-autoregressive formulation represents a foundation for a range of molecular design applications, including analog expansion, lead optimization, and direct de novo design.
