Table of Contents
Fetching ...

Hierarchical Structure-Property Alignment for Data-Efficient Molecular Generation and Editing

Ziyu Fan, Zhijian Huang, Yahan Li, Xiaowen Hu, Siyuan Shen, Yunliang Wang, Zeyu Zhong, Shuhong Liu, Shuning Yang, Shangqian Wu, Min Wu, Lei Deng

TL;DR

Experiments demonstrate that HSPAG captures fine-grained structure-property relationships and supports controllable generation under multiple property constraints and two real-world case studies further validate the editing capabilities of HSPAG.

Abstract

Property-constrained molecular generation and editing are crucial in AI-driven drug discovery but remain hindered by two factors: (i) capturing the complex relationships between molecular structures and multiple properties remains challenging, and (ii) the narrow coverage and incomplete annotations of molecular properties weaken the effectiveness of property-based models. To tackle these limitations, we propose HSPAG, a data-efficient framework featuring hierarchical structure-property alignment. By treating SMILES and molecular properties as complementary modalities, the model learns their relationships at atom, substructure, and whole-molecule levels. Moreover, we select representative samples through scaffold clustering and hard samples via an auxiliary variational auto-encoder (VAE), substantially reducing the required pre-training data. In addition, we incorporate a property relevance-aware masking mechanism and diversified perturbation strategies to enhance generation quality under sparse annotations. Experiments demonstrate that HSPAG captures fine-grained structure-property relationships and supports controllable generation under multiple property constraints. Two real-world case studies further validate the editing capabilities of HSPAG.

Hierarchical Structure-Property Alignment for Data-Efficient Molecular Generation and Editing

TL;DR

Experiments demonstrate that HSPAG captures fine-grained structure-property relationships and supports controllable generation under multiple property constraints and two real-world case studies further validate the editing capabilities of HSPAG.

Abstract

Property-constrained molecular generation and editing are crucial in AI-driven drug discovery but remain hindered by two factors: (i) capturing the complex relationships between molecular structures and multiple properties remains challenging, and (ii) the narrow coverage and incomplete annotations of molecular properties weaken the effectiveness of property-based models. To tackle these limitations, we propose HSPAG, a data-efficient framework featuring hierarchical structure-property alignment. By treating SMILES and molecular properties as complementary modalities, the model learns their relationships at atom, substructure, and whole-molecule levels. Moreover, we select representative samples through scaffold clustering and hard samples via an auxiliary variational auto-encoder (VAE), substantially reducing the required pre-training data. In addition, we incorporate a property relevance-aware masking mechanism and diversified perturbation strategies to enhance generation quality under sparse annotations. Experiments demonstrate that HSPAG captures fine-grained structure-property relationships and supports controllable generation under multiple property constraints. Two real-world case studies further validate the editing capabilities of HSPAG.

Paper Structure

This paper contains 27 sections, 7 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: The number of scaffolds with different sampling strategy.
  • Figure 2: (a) Correlation of CVAE LM loss and Availability. (b) Correlation of RMSE and CVAE LM loss. (c) Correlation of CVAE LM loss and VAE LM loss.
  • Figure 3: Distribution comparison across five molecular properties (HeavyAtomNum, QED, SA_Score, MolWt, and SMILES_Length) for different dataset splits.
  • Figure 4: Molecular generation pipeline.
  • Figure 5: Molecular property prediction Pipeline.
  • ...and 9 more figures