Table of Contents
Fetching ...

D3MES: Diffusion Transformer with multihead equivariant self-attention for 3D molecule generation

Zhejun Zhang, Yuanping Chen, Shibing Chu

TL;DR

D3MES tackles ab initio 3D molecular generation with the challenges of correct hydrogen placement and simultaneous multi-class generation. It merges a Diffusion Transformer with multihead SE(3)-equivariant self-attention, operating on patchified latent representations and trained with a combined noise- and variance-based loss, to produce 3D coordinates, element types, and bond connectivity. The approach achieves state-of-the-art or near-state-of-the-art performance on QM9 and exceptional results on the large GEOM-Drugs dataset, while supporting efficient generation through patchification and multi-class handling. This framework enables robust, early-stage generation of diverse candidate molecules at scale, providing a foundation for downstream validation and screening in drug design and materials discovery.

Abstract

Understanding and predicting the diverse conformational states of molecules is crucial for advancing fields such as chemistry, material science, and drug development. Despite significant progress in generative models, accurately generating complex and biologically or material-relevant molecular structures remains a major challenge. In this work, we introduce a diffusion model for three-dimensional (3D) molecule generation that combines a classifiable diffusion model, Diffusion Transformer, with multihead equivariant self-attention. This method addresses two key challenges: correctly attaching hydrogen atoms in generated molecules through learning representations of molecules after hydrogen atoms are removed; and overcoming the limitations of existing models that cannot generate molecules across multiple classes simultaneously. The experimental results demonstrate that our model not only achieves state-of-the-art performance across several key metrics but also exhibits robustness and versatility, making it highly suitable for early-stage large-scale generation processes in molecular design, followed by validation and further screening to obtain molecules with specific properties.

D3MES: Diffusion Transformer with multihead equivariant self-attention for 3D molecule generation

TL;DR

D3MES tackles ab initio 3D molecular generation with the challenges of correct hydrogen placement and simultaneous multi-class generation. It merges a Diffusion Transformer with multihead SE(3)-equivariant self-attention, operating on patchified latent representations and trained with a combined noise- and variance-based loss, to produce 3D coordinates, element types, and bond connectivity. The approach achieves state-of-the-art or near-state-of-the-art performance on QM9 and exceptional results on the large GEOM-Drugs dataset, while supporting efficient generation through patchification and multi-class handling. This framework enables robust, early-stage generation of diverse candidate molecules at scale, providing a foundation for downstream validation and screening in drug design and materials discovery.

Abstract

Understanding and predicting the diverse conformational states of molecules is crucial for advancing fields such as chemistry, material science, and drug development. Despite significant progress in generative models, accurately generating complex and biologically or material-relevant molecular structures remains a major challenge. In this work, we introduce a diffusion model for three-dimensional (3D) molecule generation that combines a classifiable diffusion model, Diffusion Transformer, with multihead equivariant self-attention. This method addresses two key challenges: correctly attaching hydrogen atoms in generated molecules through learning representations of molecules after hydrogen atoms are removed; and overcoming the limitations of existing models that cannot generate molecules across multiple classes simultaneously. The experimental results demonstrate that our model not only achieves state-of-the-art performance across several key metrics but also exhibits robustness and versatility, making it highly suitable for early-stage large-scale generation processes in molecular design, followed by validation and further screening to obtain molecules with specific properties.
Paper Structure (13 sections, 11 equations, 6 figures, 2 tables)

This paper contains 13 sections, 11 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Data preprocessing. The molecular information is deposited into three channels, the first channel is the 3D coordinates of the atoms in the molecule, the second channel is the elemental information of the molecule, and the third channel is the bond connectivity information of the molecule.
  • Figure 2: Preprocessing of data into DiT blocks. Input data are transformed into new feature data by the attention mechanism, which is then combined with the original input data and passed through the patchify process to generate tokens.
  • Figure 3: Overview of the diffusion process. To generate the molecule, the process begins by initializing the noise $\epsilon_T$ and variance $\Sigma_T$, followed by iterative denoising of these variables. The goal is to generate the coordinates $x$, elemental information $h$, and bond connectivity information $b$. This is achieved by sampling from the distribution $p(\epsilon_{t-1}, \Sigma_{t-1} |\epsilon_t, \Sigma_t)$ at each iteration. During training, the model $q(\epsilon_t, \Sigma_t | x, h, b)$ at time step $t$ is used to add noise and variance to the data points $x$, $h$, $b$ to learn the denoising process.
  • Figure 4: D3MES. The normalized block version of the DiT block with adaLN-Zero, with an additional layer of multihead equivariant self-attention.
  • Figure 5: Random generation of molecules. The molecules above are generated on the basis of the QM9 dataset, and the molecules below are generated on the basis of the Drugs dataset.
  • ...and 1 more figures