Table of Contents
Fetching ...

Lift Your Molecules: Molecular Graph Generation in Latent Euclidean Space

Mohamed Amine Ketata, Nicholas Gao, Johanna Sommer, Tom Wollschläger, Stephan Günnemann

TL;DR

This work addresses molecular graph generation by translating discrete graphs into a continuous latent Euclidean space via Synthetic Coordinate Embedding (SyCo), enabling diffusion-based, all-at-once generation with permutation and rotational equivariance. It implements EDM-SyCo, combining an EGNN-based autoencoder with an $E(3)$-equivariant diffusion model to learn a robust, invertible mapping between graphs and latent point clouds, and a decoder that reconstructs graphs from latent embeddings. The authors also introduce diffusion-guided controllable generation, including property-conditioned sampling, scaffold inpainting, and similarity-constrained optimization, all within the same latent framework. Experiments on ZINC250K and GuacaMol show state-of-the-art distribution learning and strong conditional generation and optimization performance, highlighting the practical potential of latent Euclidean generative models for accelerating drug discovery.

Abstract

We introduce a new framework for molecular graph generation with 3D molecular generative models. Our Synthetic Coordinate Embedding (SyCo) framework maps molecular graphs to Euclidean point clouds via synthetic conformer coordinates and learns the inverse map using an E(n)-Equivariant Graph Neural Network (EGNN). The induced point cloud-structured latent space is well-suited to apply existing 3D molecular generative models. This approach simplifies the graph generation problem - without relying on molecular fragments nor autoregressive decoding - into a point cloud generation problem followed by node and edge classification tasks. Further, we propose a novel similarity-constrained optimization scheme for 3D diffusion models based on inpainting and guidance. As a concrete implementation of our framework, we develop EDM-SyCo based on the E(3) Equivariant Diffusion Model (EDM). EDM-SyCo achieves state-of-the-art performance in distribution learning of molecular graphs, outperforming the best non-autoregressive methods by more than 30% on ZINC250K and 16% on the large-scale GuacaMol dataset while improving conditional generation by up to 3.9 times.

Lift Your Molecules: Molecular Graph Generation in Latent Euclidean Space

TL;DR

This work addresses molecular graph generation by translating discrete graphs into a continuous latent Euclidean space via Synthetic Coordinate Embedding (SyCo), enabling diffusion-based, all-at-once generation with permutation and rotational equivariance. It implements EDM-SyCo, combining an EGNN-based autoencoder with an -equivariant diffusion model to learn a robust, invertible mapping between graphs and latent point clouds, and a decoder that reconstructs graphs from latent embeddings. The authors also introduce diffusion-guided controllable generation, including property-conditioned sampling, scaffold inpainting, and similarity-constrained optimization, all within the same latent framework. Experiments on ZINC250K and GuacaMol show state-of-the-art distribution learning and strong conditional generation and optimization performance, highlighting the practical potential of latent Euclidean generative models for accelerating drug discovery.

Abstract

We introduce a new framework for molecular graph generation with 3D molecular generative models. Our Synthetic Coordinate Embedding (SyCo) framework maps molecular graphs to Euclidean point clouds via synthetic conformer coordinates and learns the inverse map using an E(n)-Equivariant Graph Neural Network (EGNN). The induced point cloud-structured latent space is well-suited to apply existing 3D molecular generative models. This approach simplifies the graph generation problem - without relying on molecular fragments nor autoregressive decoding - into a point cloud generation problem followed by node and edge classification tasks. Further, we propose a novel similarity-constrained optimization scheme for 3D diffusion models based on inpainting and guidance. As a concrete implementation of our framework, we develop EDM-SyCo based on the E(3) Equivariant Diffusion Model (EDM). EDM-SyCo achieves state-of-the-art performance in distribution learning of molecular graphs, outperforming the best non-autoregressive methods by more than 30% on ZINC250K and 16% on the large-scale GuacaMol dataset while improving conditional generation by up to 3.9 times.
Paper Structure (53 sections, 2 theorems, 14 equations, 9 figures, 14 tables, 2 algorithms)

This paper contains 53 sections, 2 theorems, 14 equations, 9 figures, 14 tables, 2 algorithms.

Key Result

Proposition 4.1

The marginal distribution of molecular graphs $p_{\theta, \xi}({\mathcal{G}}) = \mathbb{E}_{p_\theta({\bm{z}}_0)}\left[ p_\xi({\mathcal{G}} | {\bm{z}}_0) \right]$ defined by the EDM and the decoder, is an $S_N$-invariant distribution, i.e. for any molecular graph ${\mathcal{G}} = ({\bm{h}}, {\bm{A}}

Figures (9)

  • Figure 1: Overview of EDM-SyCo. (Training) First, the autoencoder is trained to map between molecular graphs and latent Euclidean point clouds. Then, the diffusion model is trained on the fixed latent space. (Sampling) Starting with a Gaussian sample, the diffusion model denoises it for $T$ steps to predict the clean point cloud, which is mapped to a molecular graph using the decoder.
  • Figure 2: Autoencoder architecture. The encoder maps a molecular graph to a point cloud, and the decoder learns the inverse. Both are trained jointly to minimize the reconstruction loss.
  • Figure 3: Overview of our constrained optimization procedure. Based on a noising/denoising approach, we run the first steps of the reverse diffusion process using the inpainting algorithm to add new atoms and the remaining steps using the guidance algorithm to increase the target property. The depicted molecules have QED values of 0.79 (initial) and 0.91 (optimized), with a 53% similarity.
  • Figure 4: Sample molecules generated by EDM-SyCo trained on ZINC250K.
  • Figure 5: Sample molecules generated by EDM-SyCo trained on GuacaMol.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Proposition 4.1
  • Proposition C.1
  • proof