Table of Contents
Fetching ...

Multi-granularity Score-based Generative Framework Enables Efficient Inverse Design of Complex Organics

Zijun Chen, Yu Wang, Liuzhenghao Lv, Hao Li, Zongying Lin, Li Yuan, Yonghong Tian

TL;DR

OrgMol-Design tackles inverse design of complex organics by combining a fragment-prior score-based generator for coarse-grained scaffolds with a chemistry-informed fine-grained bond scorer. It models generation over a fragment graph \\mathbf{G}^{\\mathcal{F}}=(\\mathbf{F},\\mathbf{C})$ using two score networks \\boldsymbol{\\epsilon}_{\\theta,t}$ and \\boldsymbol{\\epsilon}_{\\phi,t}$ to estimate node and topology scores across time steps \\in [0,T], and then refines assembled structures via a bond-scoring module that enforces chemical validity. A learned fragment vocabulary built with a Byte Pair Encoding–style bottom-up merge reduces atomic complexity and preserves essential substructures. Across four challenging benchmarks (OPVs, reaction substrates, organic emitters, and protein ligands), OrgMol-Design achieves state-of-the-art results and substantial efficiency gains over atom-level diffusion baselines, underscoring the value of fragment priors for scalable, high-quality inverse design of complex organics.

Abstract

Efficiently retrieving an enormous chemical library to design targeted molecules is crucial for accelerating drug discovery, organic chemistry, and optoelectronic materials. Despite the emergence of generative models to produce novel drug-like molecules, in a more realistic scenario, the complexity of functional groups (e.g., pyrene, acenaphthylene, and bridged-ring systems) and extensive molecular scaffolds remain challenging obstacles for the generation of complex organics. Traditionally, the former demands an extra learning process, e.g., molecular pre-training, and the latter requires expensive computational resources. To address these challenges, we propose OrgMol-Design, a multi-granularity framework for efficiently designing complex organics. Our OrgMol-Design is composed of a score-based generative model via fragment prior for diverse coarse-grained scaffold generation and a chemical-rule-aware scoring model for fine-grained molecular structure design, circumventing the difficulty of intricate substructure learning without losing connection details among fragments. Our approach achieves state-of-the-art performance in four real-world and more challenging benchmarks covering broader scientific domains, outperforming advanced molecule generative models. Additionally, it delivers a substantial speedup and graphics memory reduction compared to diffusion-based graph models. Our results also demonstrate the importance of leveraging fragment prior for a generalized molecule inverse design model.

Multi-granularity Score-based Generative Framework Enables Efficient Inverse Design of Complex Organics

TL;DR

OrgMol-Design tackles inverse design of complex organics by combining a fragment-prior score-based generator for coarse-grained scaffolds with a chemistry-informed fine-grained bond scorer. It models generation over a fragment graph \\mathbf{G}^{\\mathcal{F}}=(\\mathbf{F},\\mathbf{C}) and \\boldsymbol{\\epsilon}_{\\phi,t}$ to estimate node and topology scores across time steps \\in [0,T], and then refines assembled structures via a bond-scoring module that enforces chemical validity. A learned fragment vocabulary built with a Byte Pair Encoding–style bottom-up merge reduces atomic complexity and preserves essential substructures. Across four challenging benchmarks (OPVs, reaction substrates, organic emitters, and protein ligands), OrgMol-Design achieves state-of-the-art results and substantial efficiency gains over atom-level diffusion baselines, underscoring the value of fragment priors for scalable, high-quality inverse design of complex organics.

Abstract

Efficiently retrieving an enormous chemical library to design targeted molecules is crucial for accelerating drug discovery, organic chemistry, and optoelectronic materials. Despite the emergence of generative models to produce novel drug-like molecules, in a more realistic scenario, the complexity of functional groups (e.g., pyrene, acenaphthylene, and bridged-ring systems) and extensive molecular scaffolds remain challenging obstacles for the generation of complex organics. Traditionally, the former demands an extra learning process, e.g., molecular pre-training, and the latter requires expensive computational resources. To address these challenges, we propose OrgMol-Design, a multi-granularity framework for efficiently designing complex organics. Our OrgMol-Design is composed of a score-based generative model via fragment prior for diverse coarse-grained scaffold generation and a chemical-rule-aware scoring model for fine-grained molecular structure design, circumventing the difficulty of intricate substructure learning without losing connection details among fragments. Our approach achieves state-of-the-art performance in four real-world and more challenging benchmarks covering broader scientific domains, outperforming advanced molecule generative models. Additionally, it delivers a substantial speedup and graphics memory reduction compared to diffusion-based graph models. Our results also demonstrate the importance of leveraging fragment prior for a generalized molecule inverse design model.
Paper Structure (42 sections, 16 equations, 12 figures, 8 tables, 2 algorithms)

This paper contains 42 sections, 16 equations, 12 figures, 8 tables, 2 algorithms.

Figures (12)

  • Figure 1: The motivation of our fragment-based diffusion framework. While the atom-level diffusion model demands extra learning cost and the structure-by-structure model cannot maintain the permutation invariance, the fragment-based diffusion framework conquers these challenges.
  • Figure 2: An overview of OrgMol-Design. (Left) Coarse-grained fragment generation. Sampling randomly connected fragments from the prior distribution at $t=T$. Colored trajectories represent different diffusion processes in the joint space of fragment features and connections. (Middle) Generated fragments and connections at $t=0$. (Right) Fine-grained bond scoring. The highest-scoring connection is selected, completing the molecule after a chemical-rule check.
  • Figure 3: Examples of generated molecules corresponding to each metric value.
  • Figure 4: Time and graphics memory cost comparison between OrgMol-Design and GDSS during (a) diffusion training and (b) sampling of generating molecules.
  • Figure 5: An example of fragment vocabulary construction on a given training set {CCC=O, CC=CC, COC=O}. (a) The vocabulary is initialized by single atoms. (b) In the first iteration, the fragment CC (highlighted in blue) emerges as the most frequent and is subsequently added to the vocabulary. All occurrences of CC are then merged to update the molecular graphs. (c) In the second iteration, CO (highlighted in yellow) is the most frequent, leading to its addition to the vocabulary and subsequent merging. After two iterations, the vocabulary is constructed as {C, O, CC, CO}.
  • ...and 7 more figures