Lift Your Molecules: Molecular Graph Generation in Latent Euclidean Space
Mohamed Amine Ketata, Nicholas Gao, Johanna Sommer, Tom Wollschläger, Stephan Günnemann
TL;DR
This work addresses molecular graph generation by translating discrete graphs into a continuous latent Euclidean space via Synthetic Coordinate Embedding (SyCo), enabling diffusion-based, all-at-once generation with permutation and rotational equivariance. It implements EDM-SyCo, combining an EGNN-based autoencoder with an $E(3)$-equivariant diffusion model to learn a robust, invertible mapping between graphs and latent point clouds, and a decoder that reconstructs graphs from latent embeddings. The authors also introduce diffusion-guided controllable generation, including property-conditioned sampling, scaffold inpainting, and similarity-constrained optimization, all within the same latent framework. Experiments on ZINC250K and GuacaMol show state-of-the-art distribution learning and strong conditional generation and optimization performance, highlighting the practical potential of latent Euclidean generative models for accelerating drug discovery.
Abstract
We introduce a new framework for molecular graph generation with 3D molecular generative models. Our Synthetic Coordinate Embedding (SyCo) framework maps molecular graphs to Euclidean point clouds via synthetic conformer coordinates and learns the inverse map using an E(n)-Equivariant Graph Neural Network (EGNN). The induced point cloud-structured latent space is well-suited to apply existing 3D molecular generative models. This approach simplifies the graph generation problem - without relying on molecular fragments nor autoregressive decoding - into a point cloud generation problem followed by node and edge classification tasks. Further, we propose a novel similarity-constrained optimization scheme for 3D diffusion models based on inpainting and guidance. As a concrete implementation of our framework, we develop EDM-SyCo based on the E(3) Equivariant Diffusion Model (EDM). EDM-SyCo achieves state-of-the-art performance in distribution learning of molecular graphs, outperforming the best non-autoregressive methods by more than 30% on ZINC250K and 16% on the large-scale GuacaMol dataset while improving conditional generation by up to 3.9 times.
