Table of Contents
Fetching ...

MolHIT: Advancing Molecular-Graph Generation with Hierarchical Discrete Diffusion Models

Hojung Jung, Rodrigo Hormazabal, Jaehyeong Jo, Youngrok Park, Kyunggeun Roh, Se-Young Yun, Sehui Han, Dae-Woong Jeong

TL;DR

MolHIT introduces a Hierarchical Discrete Diffusion Model (HDDM) with Decoupled Atom Encoding (DAE) to advance molecular-graph generation. The coarse-to-fine diffusion, combined with dedicated atom-state grouping and an EDM-amenable PN-sampler, yields near-perfect validity and state-of-the-art performance on MOSES while maintaining high structural novelty. Conditional generation and scaffold extension experiments demonstrate robust multi-property control and practical applicability. The approach closes the gap between graph-based validity and 1D sequence-based exploration, enabling end-to-end atom-level molecule generation with explicit charged and aromatic states and scalable conditioning for drug-like discovery.

Abstract

Molecular generation with diffusion models has emerged as a promising direction for AI-driven drug discovery and materials science. While graph diffusion models have been widely adopted due to the discrete nature of 2D molecular graphs, existing models suffer from low chemical validity and struggle to meet the desired properties compared to 1D modeling. In this work, we introduce MolHIT, a powerful molecular graph generation framework that overcomes long-standing performance limitations in existing methods. MolHIT is based on the Hierarchical Discrete Diffusion Model, which generalizes discrete diffusion to additional categories that encode chemical priors, and decoupled atom encoding that splits the atom types according to their chemical roles. Overall, MolHIT achieves new state-of-the-art performance on the MOSES dataset with near-perfect validity for the first time in graph diffusion, surpassing strong 1D baselines across multiple metrics. We further demonstrate strong performance in downstream tasks, including multi-property guided generation and scaffold extension.

MolHIT: Advancing Molecular-Graph Generation with Hierarchical Discrete Diffusion Models

TL;DR

MolHIT introduces a Hierarchical Discrete Diffusion Model (HDDM) with Decoupled Atom Encoding (DAE) to advance molecular-graph generation. The coarse-to-fine diffusion, combined with dedicated atom-state grouping and an EDM-amenable PN-sampler, yields near-perfect validity and state-of-the-art performance on MOSES while maintaining high structural novelty. Conditional generation and scaffold extension experiments demonstrate robust multi-property control and practical applicability. The approach closes the gap between graph-based validity and 1D sequence-based exploration, enabling end-to-end atom-level molecule generation with explicit charged and aromatic states and scalable conditioning for drug-like discovery.

Abstract

Molecular generation with diffusion models has emerged as a promising direction for AI-driven drug discovery and materials science. While graph diffusion models have been widely adopted due to the discrete nature of 2D molecular graphs, existing models suffer from low chemical validity and struggle to meet the desired properties compared to 1D modeling. In this work, we introduce MolHIT, a powerful molecular graph generation framework that overcomes long-standing performance limitations in existing methods. MolHIT is based on the Hierarchical Discrete Diffusion Model, which generalizes discrete diffusion to additional categories that encode chemical priors, and decoupled atom encoding that splits the atom types according to their chemical roles. Overall, MolHIT achieves new state-of-the-art performance on the MOSES dataset with near-perfect validity for the first time in graph diffusion, surpassing strong 1D baselines across multiple metrics. We further demonstrate strong performance in downstream tasks, including multi-property guided generation and scaffold extension.
Paper Structure (86 sections, 5 theorems, 37 equations, 7 figures, 13 tables, 1 algorithm)

This paper contains 86 sections, 5 theorems, 37 equations, 7 figures, 13 tables, 1 algorithm.

Key Result

Lemma 3.1

Define diffusion schedules $\alpha_t, \beta_t$ to be monotonically decreasing functions satisfying the boundary conditions $\alpha_0=\beta_0=1$ and $\alpha_1=\beta_1=0$, such that $\alpha_t \leq \beta_t$ for all $t$. We define the forward diffusion process of the hierarchical Markov chain via the tr where $\alpha_{t|s}:=\alpha_t / \alpha_s, \beta_{t|s}:= \beta_t / \beta_s$. Then, the transition ke

Figures (7)

  • Figure 1: MolHIT achieves SOTA result on MOSES dataset. (Top) Near-perfect validity, outperforming existing graph diffusion models. (Bottom) Pareto-optimal in quality-novelty trade-off.
  • Figure 2: Overview of MolHIT. (a) Markov chain of Hierarchical Discrete Diffusion Model (HDDM). Clean states ($S_0$) are transited to the mid-level states ($S_1$) and finally to the masked state ($S_2$). (b) Generation process of MolHIT. From the masked prior, atoms are denoised into mid-level states and then to atomic tokens in a coarse-to-fine manner. (c) Phase diagram of HDDM showing the transition probability of the forward process. (d) Decoupled atom encoding scheme, separately encoding the aromatic and charged atom types.
  • Figure 3: Existing atom encoding for molecular graph is ill-posed. (Left) Reconstruction success rate on the Moses dataset with previous encoding and our decoupled atom encoding. (Right) Proportion of generated molecules containing pyrrolic nitrogen $[nH]$.
  • Figure 4: Effect of top-p sampling in MolHIT.
  • Figure 5: The ratios of generated molecules having formal charge. MolHIT can reach to the training level proportion, while models with previous coarse encoding (left two) barely generate the charged atoms.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Lemma 3.1
  • Theorem 3.2
  • Proposition 3.1
  • proof
  • Theorem 3.2
  • Corollary 3.3