Table of Contents
Fetching ...

PropMolFlow: Property-Guided Molecule Generation with Geometry-Complete Flow Matching

Cheng Zeng, Jirui Jin, Connor Ambrose, George Karypis, Mark Transtrum, Ellad B. Tadmor, Richard G. Hennig, Adrian Roitberg, Stefano Martiniani, Mingjie Liu

TL;DR

PropMolFlow advances property-guided 3D molecule generation by marrying property embeddings with geometry-complete SE(3) flow matching. It jointly conditions atom types, charges, bond orders, and coordinates via a discrete CTMC and continuous flow framework, enabling fast, accurate conditional generation with diverse property embeddings, including a Gaussian expansion of properties. The method achieves competitive ID performance against diffusion baselines, accelerates sampling with as few as 100 steps, and is validated through DFT calculations and an out-of-distribution generation task that demonstrates novelty and extrapolation capabilities. These findings establish PropMolFlow as a scalable, property-aware generator for small molecules and a foundation for future extensions to larger datasets and multi-property conditioning, with potential integration into active learning and topology-aware design.

Abstract

Molecule generation is advancing rapidly in chemical discovery and drug design. Flow matching methods have recently set the state of the art (SOTA) in unconditional molecule generation, surpassing score-based diffusion models. However, diffusion models still lead in property-guided generation. In this work, we introduce PropMolFlow, an approach for property-guided molecule generation based on geometry-complete SE(3)-equivariant flow matching. Integrating five different property embedding methods with a Gaussian expansion of scalar properties, PropMolFlow achieves competitive performance against previous SOTA diffusion models in conditional molecule generation while maintaining high structural stability and validity. Additionally, it enables faster sampling speed with fewer time steps compared to baseline models. We highlight the importance of validating the properties of generated molecules through DFT calculations. Furthermore, we introduce a task to assess the model's ability to propose molecules with underrepresented property values, assessing its capacity for out-of-distribution generalization.

PropMolFlow: Property-Guided Molecule Generation with Geometry-Complete Flow Matching

TL;DR

PropMolFlow advances property-guided 3D molecule generation by marrying property embeddings with geometry-complete SE(3) flow matching. It jointly conditions atom types, charges, bond orders, and coordinates via a discrete CTMC and continuous flow framework, enabling fast, accurate conditional generation with diverse property embeddings, including a Gaussian expansion of properties. The method achieves competitive ID performance against diffusion baselines, accelerates sampling with as few as 100 steps, and is validated through DFT calculations and an out-of-distribution generation task that demonstrates novelty and extrapolation capabilities. These findings establish PropMolFlow as a scalable, property-aware generator for small molecules and a foundation for future extensions to larger datasets and multi-property conditioning, with potential integration into active learning and topology-aware design.

Abstract

Molecule generation is advancing rapidly in chemical discovery and drug design. Flow matching methods have recently set the state of the art (SOTA) in unconditional molecule generation, surpassing score-based diffusion models. However, diffusion models still lead in property-guided generation. In this work, we introduce PropMolFlow, an approach for property-guided molecule generation based on geometry-complete SE(3)-equivariant flow matching. Integrating five different property embedding methods with a Gaussian expansion of scalar properties, PropMolFlow achieves competitive performance against previous SOTA diffusion models in conditional molecule generation while maintaining high structural stability and validity. Additionally, it enables faster sampling speed with fewer time steps compared to baseline models. We highlight the importance of validating the properties of generated molecules through DFT calculations. Furthermore, we introduce a task to assess the model's ability to propose molecules with underrepresented property values, assessing its capacity for out-of-distribution generalization.

Paper Structure

This paper contains 9 sections, 16 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of the PropMolFlow methodology. a, PropMolFlow models are jointly trained on a molecular graph and property embedding. A molecular graph includes node scalar features, node position features and edge bond features. A property embedding comprises an optional Gaussian expansion mapping (G.E.) followed by a multilayer perceptron (MLP) that projects a scalar property to a high-dimensional embedding space. Conditioning on the property is achieved by the interaction between the property embedding and node scalar features. b, A joint flow matching process is used to generate different molecule modalities together. c, Five interaction types between a property embedding 'P' and a molecular graph. Node scalar features $[A_t, C_t]$ include atom type $A$ and formal charge $C$ at time $t$. An MLP transformation $\varphi_\theta$ is applied to convert the dimension back to that of the original $[A_t, C_t]$, where necessary. '$\odot$' represents an element-wise Hadamard product, and '$\oplus$' indicates a 'Concatenate' operation. $E_t$ and $X_t$ represent respective bond edge features and node position features. d, Gaussian expansion as augmented property embedding. Curves in the top panel correspond to five Gaussian basis functions that are evenly spaced between the minimum and maximum property values. Centers of Gaussians are marked by gray dashed lines, and the red solid lines represent two example inputs. The bottom panels show the function values for each Gaussian for two inputs. In a molecule configuration, white, gray, red, and blue indicate H, C, O, and N, respectively.
  • Figure 2: Chemical validity and sampling efficiency of PropMolFlow against five baseline models.a, Molecule stability. b, RDKit validity. c, Uniqueness. d, Closed-shell ratio. e, PoseBusters validity. f, Sampling time. The y-axis for sampling time uses a broken scale to expand 0--150 min and compress 150--360 min by a ratio of 5 for visual clarity. Each box plot summarizes the metric values computed for six molecular properties ($n=6$) and 10,000 sampled molecules. The median is shown as a solid line. The edges of the box correspond to the first and third quartiles, and the whiskers extend to values within 1.5$\times$ interquartile ranges. All individual data points are overlaid as black dots. PropMolFlow results use the top-performing models of each property in the ID tasks for property alignment.
  • Figure 3: Performance of GVP property predictors without and with DFT relaxation.a, Comparison between Target, DFT and GVP shows the reliability of GVP in evaluating MAE metrics commonly used for property-guided generation. Comparison between GVP and GVP-R, and between DFT and DFT-R shows the structural dependence of GVP-predicted and DFT-predicted properties, respectively. Comparison between GVP and DFT with and without relaxation shows the reliability of GVP in capturing ground-truth DFT values for both raw and relaxed structures. 'Target', 'DFT' and 'GVP' denote input, DFT-calculated and GVP-predicted property values on raw molecules, respectively. 'DFT-R' and 'GVP-R' refer to values evaluated on DFT-relaxed molecules. The $d$ indicates the MAE distances between two property-value vectors. b, Pairwise comparison between Target, DFT and GVP on raw molecules. c, GVP versus DFT for both raw and DFT-relaxed molecules. d, Root mean squared distances (RMSDs) and normalized property sensitivity due to DFT relaxation. Molecules with the highest RMSDs for each property and their corresponding DFT and GVP values are shown. In molecular representations, gray, red, blue and white indicate C, O, N and H atoms, respectively. Property values for $\alpha$, $\Delta\epsilon$ and $C_v$ are in units of Bohr$^3$, eV and cal mol$^{-1}$ K$^{-1}$, respectively.
  • Figure 4: Interpolation study by varying property values. The minimum and maximum target properties (red) and the corresponding DFT-calculated properties (blue) are shown below each configuration. All molecules shown pass the filtering criteria and have DFT values closest to the target properties. In molecular representations, gray, white, red and blue indicate C, H, O and N atoms, respectively. Property units are provided in square brackets under each property symbol.
  • Figure 5: Toward out-of-distribution generation.a, Distribution of DFT-calculated and GVP-predicted property values for molecules generated by PropMolFlow; the property distribution of the QM9 training data is also shown. The vertical black dashed line in the histograms denotes the target property value $q_{0.99}$, corresponding to the 99th percentile of the training-data distribution. Curves overlaid on the histograms are kernel density estimation fits. b, Three example molecules absent from QM9 but present in a larger PubChem dataset are shown in the left panel. Numbers below the configurations indicate DFT-calculated property values for raw molecules generated by PropMolFlow. In molecular representations, gray, white, red, blue and yellow indicate C, H, O, N and F atoms, respectively. Property values for $\alpha$, $\Delta\epsilon$ and $C_v$ are in units of Bohr$^3$, eV and cal mol$^{-1}$ K$^{-1}$, respectively. c, Maximum Tanimoto similarity between generated, filtered molecules and the training data computed using Morgan fingerprints. Dashed lines indicate the 0.8 similarity cutoff used to define novel molecules.