Table of Contents
Fetching ...

Domain-Agnostic Molecular Generation with Chemical Feedback

Yin Fang, Ningyu Zhang, Zhuo Chen, Lingbing Guo, Xiaohui Fan, Huajun Chen

TL;DR

This work introduces MolGen, a SELFIES-based pre-trained molecular language model designed for efficient, domain-agnostic molecule generation. It couples a two-stage pre-training scheme with domain-agnostic molecular prefixes and a chemical feedback paradigm that aligns generation with chemical preferences through a rank-based objective, mitigating molecular hallucinations. Empirical results show MolGen excels at distribution learning for synthetic and natural product molecules and achieves notable property optimization (p-logP, QED) and docking improvements, while revealing meaningful substructure focus via attention analysis. The approach offers a scalable, data-efficient path for cross-domain molecular design and opens avenues for extensions to retrosynthesis, reaction prediction, and multimodal molecular understanding.

Abstract

The generation of molecules with desired properties has become increasingly popular, revolutionizing the way scientists design molecular structures and providing valuable support for chemical and drug design. However, despite the potential of language models in molecule generation, they face challenges such as generating syntactically or chemically flawed molecules, having narrow domain focus, and struggling to create diverse and feasible molecules due to limited annotated data or external molecular databases. To tackle these challenges, we introduce MolGen, a pre-trained molecular language model tailored specifically for molecule generation. Through the reconstruction of over 100 million molecular SELFIES, MolGen internalizes structural and grammatical insights. This is further enhanced by domain-agnostic molecular prefix tuning, fostering robust knowledge transfer across diverse domains. Importantly, our chemical feedback paradigm steers the model away from molecular hallucinations, ensuring alignment between the model's estimated probabilities and real-world chemical preferences. Extensive experiments on well-known benchmarks underscore MolGen's optimization capabilities in properties such as penalized logP, QED, and molecular docking. Additional analyses confirm its proficiency in accurately capturing molecule distributions, discerning intricate structural patterns, and efficiently exploring the chemical space. Code is available at https://github.com/zjunlp/MolGen.

Domain-Agnostic Molecular Generation with Chemical Feedback

TL;DR

This work introduces MolGen, a SELFIES-based pre-trained molecular language model designed for efficient, domain-agnostic molecule generation. It couples a two-stage pre-training scheme with domain-agnostic molecular prefixes and a chemical feedback paradigm that aligns generation with chemical preferences through a rank-based objective, mitigating molecular hallucinations. Empirical results show MolGen excels at distribution learning for synthetic and natural product molecules and achieves notable property optimization (p-logP, QED) and docking improvements, while revealing meaningful substructure focus via attention analysis. The approach offers a scalable, data-efficient path for cross-domain molecular design and opens avenues for extensions to retrosynthesis, reaction prediction, and multimodal molecular understanding.

Abstract

The generation of molecules with desired properties has become increasingly popular, revolutionizing the way scientists design molecular structures and providing valuable support for chemical and drug design. However, despite the potential of language models in molecule generation, they face challenges such as generating syntactically or chemically flawed molecules, having narrow domain focus, and struggling to create diverse and feasible molecules due to limited annotated data or external molecular databases. To tackle these challenges, we introduce MolGen, a pre-trained molecular language model tailored specifically for molecule generation. Through the reconstruction of over 100 million molecular SELFIES, MolGen internalizes structural and grammatical insights. This is further enhanced by domain-agnostic molecular prefix tuning, fostering robust knowledge transfer across diverse domains. Importantly, our chemical feedback paradigm steers the model away from molecular hallucinations, ensuring alignment between the model's estimated probabilities and real-world chemical preferences. Extensive experiments on well-known benchmarks underscore MolGen's optimization capabilities in properties such as penalized logP, QED, and molecular docking. Additional analyses confirm its proficiency in accurately capturing molecule distributions, discerning intricate structural patterns, and efficiently exploring the chemical space. Code is available at https://github.com/zjunlp/MolGen.
Paper Structure (35 sections, 8 equations, 15 figures, 6 tables)

This paper contains 35 sections, 8 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: MolGen excels at generating chemically valid molecules with expected efficacy in both synthetic and natural product domains.
  • Figure 1: Comparison of visualizations of training and generated molecules.
  • Figure 2: Overview of MolGen: pre-training (left) and downstream (right) stages.
  • Figure 2: Property variations across different MolGen configurations.
  • Figure 3: Random double mutations of SMILES and SELFIES derived from the same molecule, with blue markings indicating mutation locations. The likelihood of retaining a valid SMILES after a single mutation is 9.9%. For SELFIES, it's a consistent 100% DBLP:journals/mlst/KrennHNFA20.
  • ...and 10 more figures

Theorems & Definitions (1)

  • Definition 1