3M-Diffusion: Latent Multi-Modal Diffusion for Language-Guided Molecular Structure Generation

Huaisheng Zhu; Teng Xiao; Vasant G Honavar

3M-Diffusion: Latent Multi-Modal Diffusion for Language-Guided Molecular Structure Generation

Huaisheng Zhu, Teng Xiao, Vasant G Honavar

TL;DR

3M-Diffusion tackles text-guided molecular graph generation by learning a text-molecule aligned latent space and applying a multi-modal diffusion model within that space. It first builds a text-molecule aligned variational autoencoder using encoders for graphs $E_g$ and text $E_t$, producing a shared latent $\mathbf{z}$ aligned with text representation $\mathbf{c}$ via a contrastive loss, and decodes to a molecule with a HierVAE. Then it trains a latent diffusion model conditioned on $\mathbf{c}$ to map text to molecular latent codes, enabling diverse outputs that match textual prompts while maintaining semantic fidelity. Across multiple benchmarks, it achieves higher novelty and diversity than SOTA baselines with competitive semantic similarity, and its two-stage training plus latent alignment prove essential for robust text-guided molecular generation. Overall, the approach offers a scalable, fast pathway to generate chemically valid, richly varied molecules tailored to descriptive language, with direct implications for drug discovery and materials design.

Abstract

Generating molecular structures with desired properties is a critical task with broad applications in drug discovery and materials design. We propose 3M-Diffusion, a novel multi-modal molecular graph generation method, to generate diverse, ideally novel molecular structures with desired properties. 3M-Diffusion encodes molecular graphs into a graph latent space which it then aligns with the text space learned by encoder-based LLMs from textual descriptions. It then reconstructs the molecular structure and atomic attributes based on the given text descriptions using the molecule decoder. It then learns a probabilistic mapping from the text space to the latent molecular graph space using a diffusion model. The results of our extensive experiments on several datasets demonstrate that 3M-Diffusion can generate high-quality, novel and diverse molecular graphs that semantically match the textual description provided.

3M-Diffusion: Latent Multi-Modal Diffusion for Language-Guided Molecular Structure Generation

TL;DR

and text

, producing a shared latent

aligned with text representation

via a contrastive loss, and decodes to a molecule with a HierVAE. Then it trains a latent diffusion model conditioned on

to map text to molecular latent codes, enabling diverse outputs that match textual prompts while maintaining semantic fidelity. Across multiple benchmarks, it achieves higher novelty and diversity than SOTA baselines with competitive semantic similarity, and its two-stage training plus latent alignment prove essential for robust text-guided molecular generation. Overall, the approach offers a scalable, fast pathway to generate chemically valid, richly varied molecules tailored to descriptive language, with direct implications for drug discovery and materials design.

Abstract

Paper Structure (30 sections, 14 equations, 10 figures, 5 tables, 2 algorithms)

This paper contains 30 sections, 14 equations, 10 figures, 5 tables, 2 algorithms.

Introduction
Related Work
Molecular Structure Generation
Language-guided Molecule Generation
Preliminaries
Problem Definition
Diffusion Models
Latent Multi-Modal Diffusion for Molecules
Text-Molecule Aligned Variational Autoencoder
Multi-modal Molecule Latent Diffusion
Training and Inference
Experiments and Results
Experimental Setup
Text-guided Molecule Generation
Unconditional Molecule Generation
...and 15 more sections

Figures (10)

Figure 1: The overview of 3M-diffusion, with a molecular graph encoder/decoder and a latent diffusion model conditioned on a prior (an aligned LLM encoder). The details of alignment in the text/graph encoders and diffusion model are given in Section \ref{['sec:method']}.
Figure 2: Qualitative comparisons to the MolT5-large of generated molecules on ChEBI-20. Compared with the SOTA method MolT5-large, our results are more diverse and novel with maintained semantics in textual prompt. More results are provided in Appendix \ref{['app:case']}
Figure 3: Inference time comparison for conditional molecule generation on ChEBI-20.
Figure 4: Molecules generated conditionally on input text by 3M-Diffusion on ChEBI-20.
Figure 5: Qualitative comparisons to the MolT5-large in terms of generated molecules on CheBI-20. Compared with the SOTA method MolT5-large, our generated results are more diverse and novel with maintained semantics in textual prompt.
...and 5 more figures

3M-Diffusion: Latent Multi-Modal Diffusion for Language-Guided Molecular Structure Generation

TL;DR

Abstract

3M-Diffusion: Latent Multi-Modal Diffusion for Language-Guided Molecular Structure Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)