Table of Contents
Fetching ...

MetaMolGen: A Neural Graph Motif Generation Model for De Novo Molecular Design

Zimo Yan, Jie Zhang, Zheng Xie, Chang Liu, Yizhen Liu, Yiping Song

TL;DR

MetaMolGen tackles data-scarce, property-conditioned molecular design by uniting first-order meta-learning (Reptile) with Conditional Neural Processes to learn task-aware molecular distributions from few examples. A learnable feature standardization layer stabilizes training and improves generalization, while a SMILES autoregressive decoder generates valid, diverse molecules conditioned on target properties via a lightweight property projector. Across ChEMBL, QM9, ZINC, and MOSES, MetaMolGen shows superior few-shot performance, strong conditional control, and fast generation, achieving high uniqueness and favorable property alignment with improved efficiency. The work advances data-efficient, controllable molecular design and provides theoretical guarantees on convergence and generalization, with practical impact for early-stage drug and materials discovery under limited data.

Abstract

Molecular generation plays an important role in drug discovery and materials science, especially in data-scarce scenarios where traditional generative models often struggle to achieve satisfactory conditional generalization. To address this challenge, we propose MetaMolGen, a first-order meta-learning-based molecular generator designed for few-shot and property-conditioned molecular generation. MetaMolGen standardizes the distribution of graph motifs by mapping them to a normalized latent space, and employs a lightweight autoregressive sequence model to generate SMILES sequences that faithfully reflect the underlying molecular structure. In addition, it supports conditional generation of molecules with target properties through a learnable property projector integrated into the generative process.Experimental results demonstrate that MetaMolGen consistently generates valid and diverse SMILES sequences under low-data regimes, outperforming conventional baselines. This highlights its advantage in fast adaptation and efficient conditional generation for practical molecular design.

MetaMolGen: A Neural Graph Motif Generation Model for De Novo Molecular Design

TL;DR

MetaMolGen tackles data-scarce, property-conditioned molecular design by uniting first-order meta-learning (Reptile) with Conditional Neural Processes to learn task-aware molecular distributions from few examples. A learnable feature standardization layer stabilizes training and improves generalization, while a SMILES autoregressive decoder generates valid, diverse molecules conditioned on target properties via a lightweight property projector. Across ChEMBL, QM9, ZINC, and MOSES, MetaMolGen shows superior few-shot performance, strong conditional control, and fast generation, achieving high uniqueness and favorable property alignment with improved efficiency. The work advances data-efficient, controllable molecular design and provides theoretical guarantees on convergence and generalization, with practical impact for early-stage drug and materials discovery under limited data.

Abstract

Molecular generation plays an important role in drug discovery and materials science, especially in data-scarce scenarios where traditional generative models often struggle to achieve satisfactory conditional generalization. To address this challenge, we propose MetaMolGen, a first-order meta-learning-based molecular generator designed for few-shot and property-conditioned molecular generation. MetaMolGen standardizes the distribution of graph motifs by mapping them to a normalized latent space, and employs a lightweight autoregressive sequence model to generate SMILES sequences that faithfully reflect the underlying molecular structure. In addition, it supports conditional generation of molecules with target properties through a learnable property projector integrated into the generative process.Experimental results demonstrate that MetaMolGen consistently generates valid and diverse SMILES sequences under low-data regimes, outperforming conventional baselines. This highlights its advantage in fast adaptation and efficient conditional generation for practical molecular design.

Paper Structure

This paper contains 43 sections, 9 theorems, 81 equations, 8 figures, 5 tables, 2 algorithms.

Key Result

Theorem 1

Let $L_T(\theta)$ be an $L$-smooth loss function. If the learning rate $\alpha$ satisfies $0 < \alpha \leq \frac{2}{L},$ then the sequence of losses $\{L_T(\theta_k)\}$ is monotonically decreasing, where

Figures (8)

  • Figure 1: Distribution of Key Molecular Properties
  • Figure 2: Overview of the MetaMolGen training process.
  • Figure 3: Comparison of key generation metrics across different models, including validity, uniqueness, druglikeness, and overall performance score.
  • Figure 4: Few-shot performance comparison of MetaMolGen, RNN, and MolGPT across training set sizes (1k–10k) on six key metrics: validity, diversity, drug-likeness, synthesizability, solubility, and overall score.
  • Figure 5: Representative molecules generated by the MetaMolGen model using the condition vectors of Aspirin, Tamiflu, Amoxicillin, Chloroquine. Molecules exhibit high property alignment.
  • ...and 3 more figures

Theorems & Definitions (9)

  • Theorem 1: Convergence of Training
  • Theorem 2: Gradient Variance Reduction via Normalization
  • Theorem 3: Improved Conditioning and Accelerated Convergence
  • Theorem 4: Variance Reduction and Generalization Improvement
  • Theorem 5: Unbiasedness of Stochastic Gradient
  • Theorem 6: Convergence under Stochastic Updates
  • Theorem 7: Generalization and Error Bound
  • Lemma 1: Descent Lemma beck2017first
  • Lemma 2: Stability of Task Encoder bousquet2002stability