Table of Contents
Fetching ...

MolMiner: Towards Controllable, 3D-Aware, Fragment-Based Molecular Design

Raul Ortega-Ochoa, Tejs Vegge, Jes Frellsen

TL;DR

MolMiner tackles inverse molecular design by delivering a flexible, controllable generator that combines fragment-based construction with dynamic 3D geometry. It introduces an order-agnostic rollout and symmetry-aware attachment, enabling generation conditioned on up to twelve properties, with a Gaussian Mixture Model prior to sample missing conditioning values. The paper demonstrates calibrated conditional generation across most properties and competitive unconditional performance, supported by new benchmarking protocols based on Wasserstein distances and calibration plots. The work advances practical, interpretable, multi-property molecular design and has potential impact in materials discovery, drug design, and green chemistry.

Abstract

We introduce MolMiner, a fragment-based, geometry-aware, and order-agnostic autoregressive model for molecular design. MolMiner supports conditional generation of molecules over twelve properties, enabling flexible control across physicochemical and structural targets. Molecules are built via symmetry-aware fragment attachments, with 3D geometry dynamically updated during generation using forcefields. A probabilistic conditioning mechanism allows users to specify any subset of target properties while sampling the rest. MolMiner achieves calibrated conditional generation across most properties and offers competitive unconditional performance. We also propose improved benchmarking methods for both unconditional and conditional generation, including distributional comparisons via Wasserstein distance and calibration plots for property control. To our knowledge, this is the first model to unify dynamic geometry, symmetry handling, order-agnostic fragment-based generation, and high-dimensional multi-property conditioning.

MolMiner: Towards Controllable, 3D-Aware, Fragment-Based Molecular Design

TL;DR

MolMiner tackles inverse molecular design by delivering a flexible, controllable generator that combines fragment-based construction with dynamic 3D geometry. It introduces an order-agnostic rollout and symmetry-aware attachment, enabling generation conditioned on up to twelve properties, with a Gaussian Mixture Model prior to sample missing conditioning values. The paper demonstrates calibrated conditional generation across most properties and competitive unconditional performance, supported by new benchmarking protocols based on Wasserstein distances and calibration plots. The work advances practical, interpretable, multi-property molecular design and has potential impact in materials discovery, drug design, and green chemistry.

Abstract

We introduce MolMiner, a fragment-based, geometry-aware, and order-agnostic autoregressive model for molecular design. MolMiner supports conditional generation of molecules over twelve properties, enabling flexible control across physicochemical and structural targets. Molecules are built via symmetry-aware fragment attachments, with 3D geometry dynamically updated during generation using forcefields. A probabilistic conditioning mechanism allows users to specify any subset of target properties while sampling the rest. MolMiner achieves calibrated conditional generation across most properties and offers competitive unconditional performance. We also propose improved benchmarking methods for both unconditional and conditional generation, including distributional comparisons via Wasserstein distance and calibration plots for property control. To our knowledge, this is the first model to unify dynamic geometry, symmetry handling, order-agnostic fragment-based generation, and high-dimensional multi-property conditioning.

Paper Structure

This paper contains 31 sections, 15 equations, 19 figures, 4 tables.

Figures (19)

  • Figure 1: Schematic of MolMiner’s fragment-based rollout process. Given a partially grown molecule and a selected focal attachment site, the model predicts the next fragment and attachment configuration in an autoregressive manner. Rollouts proceed in an order-agnostic manner, with growth initiated by an auxiliary predictor that selects the starting fragment.
  • Figure 2: Calibration of predicted molecular properties. Continuous properties show predicted vs. prompted values with mean trends and $\pm1$ standard deviation bands; discrete properties are summarized as confusion matrices.
  • Figure 3: Elbow visualization of the BIC and AIC scores for varying number of GMM components ranging 1-19. Note that at K=8 there is a sharp drop, elbow, which marks an ideal number of components to use for this problem.
  • Figure 4: Reconstruction fidelity of 1 missing property given all others. Using the validation dataset, for every property the model is asked to reconstruct one of the properties given the rest, then the reconstructed and real distributions are compared using q-q plots annotated with the Wasserstein (W) distance.
  • Figure 5: Training and validation curves for models trained with different numbers of conditions. Curves shown in black correspond to models trained with 3 conditions, while those in blue represent models trained with 12 conditions. Models with 12 conditions consistently outperform their 3-condition counterparts in both training and validation, demonstrating improved generalization. This performance gap aligns with expectations from the tomographic effect, where increased conditioning leads to enhanced reconstruction fidelity
  • ...and 14 more figures