Table of Contents
Fetching ...

Chain-of-Generation: Progressive Latent Diffusion for Text-Guided Molecular Design

Lingxiao Li, Haobo Zhang, Bin Chen, Jiayu Zhou

TL;DR

This paper tackles the challenge of text-conditioned molecular design by identifying limitations of one-shot conditioning in diffusion-based generation. It introduces Chain-of-Generation (CoG), a training-free, multi-stage latent-diffusion framework that decomposes prompts into semantic segments and progressively guides generation, augmented by a post-alignment phase to strengthen cross-modal alignment. The approach yields higher semantic fidelity, controllability, and diversity, verified by benchmark results showing gains over strong baselines on curated datasets. The proposed curriculum-style, interpretable generation pipeline has practical implications for drug discovery and material design, enabling more reliable and transparent alignment between language descriptions and molecular structures.

Abstract

Text-conditioned molecular generation aims to translate natural-language descriptions into chemical structures, enabling scientists to specify functional groups, scaffolds, and physicochemical constraints without handcrafted rules. Diffusion-based models, particularly latent diffusion models (LDMs), have recently shown promise by performing stochastic search in a continuous latent space that compactly captures molecular semantics. Yet existing methods rely on one-shot conditioning, where the entire prompt is encoded once and applied throughout diffusion, making it hard to satisfy all the requirements in the prompt. We discuss three outstanding challenges of one-shot conditioning generation, including the poor interpretability of the generated components, the failure to generate all substructures, and the overambition in considering all requirements simultaneously. We then propose three principles to address those challenges, motivated by which we propose Chain-of-Generation (CoG), a training-free multi-stage latent diffusion framework. CoG decomposes each prompt into curriculum-ordered semantic segments and progressively incorporates them as intermediate goals, guiding the denoising trajectory toward molecules that satisfy increasingly rich linguistic constraints. To reinforce semantic guidance, we further introduce a post-alignment learning phase that strengthens the correspondence between textual and molecular latent spaces. Extensive experiments on benchmark and real-world tasks demonstrate that CoG yields higher semantic alignment, diversity, and controllability than one-shot baselines, producing molecules that more faithfully reflect complex, compositional prompts while offering transparent insight into the generation process.

Chain-of-Generation: Progressive Latent Diffusion for Text-Guided Molecular Design

TL;DR

This paper tackles the challenge of text-conditioned molecular design by identifying limitations of one-shot conditioning in diffusion-based generation. It introduces Chain-of-Generation (CoG), a training-free, multi-stage latent-diffusion framework that decomposes prompts into semantic segments and progressively guides generation, augmented by a post-alignment phase to strengthen cross-modal alignment. The approach yields higher semantic fidelity, controllability, and diversity, verified by benchmark results showing gains over strong baselines on curated datasets. The proposed curriculum-style, interpretable generation pipeline has practical implications for drug discovery and material design, enabling more reliable and transparent alignment between language descriptions and molecular structures.

Abstract

Text-conditioned molecular generation aims to translate natural-language descriptions into chemical structures, enabling scientists to specify functional groups, scaffolds, and physicochemical constraints without handcrafted rules. Diffusion-based models, particularly latent diffusion models (LDMs), have recently shown promise by performing stochastic search in a continuous latent space that compactly captures molecular semantics. Yet existing methods rely on one-shot conditioning, where the entire prompt is encoded once and applied throughout diffusion, making it hard to satisfy all the requirements in the prompt. We discuss three outstanding challenges of one-shot conditioning generation, including the poor interpretability of the generated components, the failure to generate all substructures, and the overambition in considering all requirements simultaneously. We then propose three principles to address those challenges, motivated by which we propose Chain-of-Generation (CoG), a training-free multi-stage latent diffusion framework. CoG decomposes each prompt into curriculum-ordered semantic segments and progressively incorporates them as intermediate goals, guiding the denoising trajectory toward molecules that satisfy increasingly rich linguistic constraints. To reinforce semantic guidance, we further introduce a post-alignment learning phase that strengthens the correspondence between textual and molecular latent spaces. Extensive experiments on benchmark and real-world tasks demonstrate that CoG yields higher semantic alignment, diversity, and controllability than one-shot baselines, producing molecules that more faithfully reflect complex, compositional prompts while offering transparent insight into the generation process.

Paper Structure

This paper contains 22 sections, 5 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Overview of the proposed chain-of-generation (CoG). Left: Existing diffusion approaches (e.g., zhu20243m) where a text prompt $T$ is projected to a vector in the text-embedding space used for guiding the diffusion in the molecule embedding space for generation. Right: The proposed CoG approach firstly segments the text prompt into a set of components ($T_1, T_2, T_3$) and progressively includes the components into smaller prompts to guide multi-staged diffusion for molecule generation.
  • Figure 2: Generated molecules for three textual prompts. Left: "The molecule is made of a pyridine ring."Middle: "The molecule is a pyridine ring with a nitro substituent."Right: "The molecule is made of a pyridine ring substituted with both a nitro group and fluorine atoms." Note that as the complexity of the prompt increases, one-shot conditioning may not be able to satisfy all the requirements. Moreover, latent diffusion ensembles each component progressively during generation.
  • Figure 3: Two examples where nearly identical molecular graphs are represented by drastically different SMILES strings. In such cases, NLP-based string metrics (e.g., Levenshtein) often assign a high distance score, failing to reflect the true structural correspondence. In contrast, our graph-based evaluation correctly captures scaffold and functional group similarity.
  • Figure 4: Segmentation results across different LLMs
  • Figure 5: Distribution of number of information pieces after reasoning segmentation on ChEBI-20
  • ...and 2 more figures