3M-Diffusion: Latent Multi-Modal Diffusion for Language-Guided Molecular Structure Generation
Huaisheng Zhu, Teng Xiao, Vasant G Honavar
TL;DR
3M-Diffusion tackles text-guided molecular graph generation by learning a text-molecule aligned latent space and applying a multi-modal diffusion model within that space. It first builds a text-molecule aligned variational autoencoder using encoders for graphs $E_g$ and text $E_t$, producing a shared latent $\mathbf{z}$ aligned with text representation $\mathbf{c}$ via a contrastive loss, and decodes to a molecule with a HierVAE. Then it trains a latent diffusion model conditioned on $\mathbf{c}$ to map text to molecular latent codes, enabling diverse outputs that match textual prompts while maintaining semantic fidelity. Across multiple benchmarks, it achieves higher novelty and diversity than SOTA baselines with competitive semantic similarity, and its two-stage training plus latent alignment prove essential for robust text-guided molecular generation. Overall, the approach offers a scalable, fast pathway to generate chemically valid, richly varied molecules tailored to descriptive language, with direct implications for drug discovery and materials design.
Abstract
Generating molecular structures with desired properties is a critical task with broad applications in drug discovery and materials design. We propose 3M-Diffusion, a novel multi-modal molecular graph generation method, to generate diverse, ideally novel molecular structures with desired properties. 3M-Diffusion encodes molecular graphs into a graph latent space which it then aligns with the text space learned by encoder-based LLMs from textual descriptions. It then reconstructs the molecular structure and atomic attributes based on the given text descriptions using the molecule decoder. It then learns a probabilistic mapping from the text space to the latent molecular graph space using a diffusion model. The results of our extensive experiments on several datasets demonstrate that 3M-Diffusion can generate high-quality, novel and diverse molecular graphs that semantically match the textual description provided.
