Table of Contents
Fetching ...

Towards Unified and Lossless Latent Space for 3D Molecular Latent Diffusion Modeling

Yanchen Luo, Zhiyuan Liu, Yi Zhao, Sihang Li, Hengxing Cai, Kenji Kawaguchi, Tat-Seng Chua, Yang Zhang, Xiang Wang

TL;DR

The paper tackles the multi-modal challenge of 3D molecule generation by introducing UAE-3D, a unified variational auto-encoder that encodes atom types, bonds, and 3D coordinates into a single latent space. Leveraging a Diffusion Transformer backbone (DiT) for latent diffusion, the approach eliminates the need for separate latent spaces for different modalities, improving both training and sampling efficiency. Empirical results on QM9 and GEOM-Drugs show state-of-the-art performance in de novo and conditional generation, with large gains in geometric fidelity and substantial speedups. The unified latent space and SE(3) augmentation strategy enable near-zero reconstruction error and scalable, unbiased diffusion in latent space, offering practical impact for fast, accurate 3D molecular design.

Abstract

3D molecule generation is crucial for drug discovery and material science, requiring models to process complex multi-modalities, including atom types, chemical bonds, and 3D coordinates. A key challenge is integrating these modalities of different shapes while maintaining SE(3) equivariance for 3D coordinates. To achieve this, existing approaches typically maintain separate latent spaces for invariant and equivariant modalities, reducing efficiency in both training and sampling. In this work, we propose \textbf{U}nified Variational \textbf{A}uto-\textbf{E}ncoder for \textbf{3D} Molecular Latent Diffusion Modeling (\textbf{UAE-3D}), a multi-modal VAE that compresses 3D molecules into latent sequences from a unified latent space, while maintaining near-zero reconstruction error. This unified latent space eliminates the complexities of handling multi-modality and equivariance when performing latent diffusion modeling. We demonstrate this by employing the Diffusion Transformer--a general-purpose diffusion model without any molecular inductive bias--for latent generation. Extensive experiments on GEOM-Drugs and QM9 datasets demonstrate that our method significantly establishes new benchmarks in both \textit{de novo} and conditional 3D molecule generation, achieving leading efficiency and quality. On GEOM-Drugs, it reduces FCD by 72.6\% over the previous best result, while achieving over 70\% relative average improvements in geometric fidelity. Our code is released at https://github.com/lyc0930/UAE-3D/.

Towards Unified and Lossless Latent Space for 3D Molecular Latent Diffusion Modeling

TL;DR

The paper tackles the multi-modal challenge of 3D molecule generation by introducing UAE-3D, a unified variational auto-encoder that encodes atom types, bonds, and 3D coordinates into a single latent space. Leveraging a Diffusion Transformer backbone (DiT) for latent diffusion, the approach eliminates the need for separate latent spaces for different modalities, improving both training and sampling efficiency. Empirical results on QM9 and GEOM-Drugs show state-of-the-art performance in de novo and conditional generation, with large gains in geometric fidelity and substantial speedups. The unified latent space and SE(3) augmentation strategy enable near-zero reconstruction error and scalable, unbiased diffusion in latent space, offering practical impact for fast, accurate 3D molecular design.

Abstract

3D molecule generation is crucial for drug discovery and material science, requiring models to process complex multi-modalities, including atom types, chemical bonds, and 3D coordinates. A key challenge is integrating these modalities of different shapes while maintaining SE(3) equivariance for 3D coordinates. To achieve this, existing approaches typically maintain separate latent spaces for invariant and equivariant modalities, reducing efficiency in both training and sampling. In this work, we propose \textbf{U}nified Variational \textbf{A}uto-\textbf{E}ncoder for \textbf{3D} Molecular Latent Diffusion Modeling (\textbf{UAE-3D}), a multi-modal VAE that compresses 3D molecules into latent sequences from a unified latent space, while maintaining near-zero reconstruction error. This unified latent space eliminates the complexities of handling multi-modality and equivariance when performing latent diffusion modeling. We demonstrate this by employing the Diffusion Transformer--a general-purpose diffusion model without any molecular inductive bias--for latent generation. Extensive experiments on GEOM-Drugs and QM9 datasets demonstrate that our method significantly establishes new benchmarks in both \textit{de novo} and conditional 3D molecule generation, achieving leading efficiency and quality. On GEOM-Drugs, it reduces FCD by 72.6\% over the previous best result, while achieving over 70\% relative average improvements in geometric fidelity. Our code is released at https://github.com/lyc0930/UAE-3D/.

Paper Structure

This paper contains 20 sections, 9 equations, 8 figures, 13 tables, 2 algorithms.

Figures (8)

  • Figure 1: (a) A 3D molecule has multi-modal features. (b) Prior methods use separate latent spaces for equivariant (3D) and invariant (2D) modalities, inducing unnecessary complexity for the model architecture. (c) UAE-3D reduces this complexity by establishing a unified and near-lossless latent space that integrates all molecular modalities.
  • Figure 2: Comparing UAE-3D and UDM-3D with other methods on the QM9 dataset. (a;b) Reconstruction errors on the test set. (c;d) Comparing training and inference time.
  • Figure 3: Overview of the UDM-3D and UAE-3D models. The UAE-3D encodes 3D molecules from molecular space into a unified latent space, integrating multi-modal features such as atom types, chemical bonds, and 3D coordinates. Utilizing this latent space, UDM-3D employs a DiT to perform generative modeling. Then, the denoised latents are decoded back into 3D molecules.
  • Figure 4: t-SNE visualizations of UAE-3D's latents under SE(3) augmentations. (a) Translations along a fixed direction. (b) Rotations along a fixed axis. (c) Sequential rotations followed by translations. Color gradients show increasing distances or angles.
  • Figure 5: Visualization of random samples generated by UDM-3D on QM9.
  • ...and 3 more figures