Table of Contents
Fetching ...

Diffusion Models at the Drug Discovery Frontier: A Review on Generating Small Molecules versus Therapeutic Peptides

Yiquan Wang, Yahui Ma, Yuhan Chang, Jiayao Yan, Jialin Zhang, Minnuo Cai, Kai Wei

TL;DR

This review analyzes how diffusion models are transforming drug discovery by enabling de novo design of both small molecules and therapeutic peptides. It contrasts modality-specific representations, benchmarks, and design objectives, highlighting representative systems such as Pocket2Mol, DiffSBDD, TargetDiff, RFdiffusion, and ProteinMPNN. The authors underscore data scarcity, the unreliability of scoring functions, and the necessity of experimental validation, arguing for integrated, automated DBTL pipelines to realize on-demand therapeutic design. The work articulates practical implications for accelerating discovery while outlining key challenges and opportunities to bridge computational designs with real-world synthesis and biology. Overall, diffusion models offer a promising path to shift from broad chemical exploration to targeted, efficient engineering of novel therapeutics within an automated discovery framework.

Abstract

Diffusion models have emerged as a leading framework in generative modeling, poised to transform the traditionally slow and costly process of drug discovery. This review provides a systematic comparison of their application in designing two principal therapeutic modalities: small molecules and therapeutic peptides. We dissect how the unified framework of iterative denoising is adapted to the distinct molecular representations, chemical spaces, and design objectives of each modality. For small molecules, these models excel at structure-based design, generating novel, pocket-fitting ligands with desired physicochemical properties, yet face the critical hurdle of ensuring chemical synthesizability. Conversely, for therapeutic peptides, the focus shifts to generating functional sequences and designing de novo structures, where the primary challenges are achieving biological stability against proteolysis, ensuring proper folding, and minimizing immunogenicity. Despite these distinct challenges, both domains face shared hurdles: the scarcity of high-quality experimental data, the reliance on inaccurate scoring functions for validation, and the crucial need for experimental validation. We conclude that the full potential of diffusion models will be unlocked by bridging these modality-specific gaps and integrating them into automated, closed-loop Design-Build-Test-Learn (DBTL) platforms, thereby shifting the paradigm from mere chemical exploration to the on-demand engineering of novel~therapeutics.

Diffusion Models at the Drug Discovery Frontier: A Review on Generating Small Molecules versus Therapeutic Peptides

TL;DR

This review analyzes how diffusion models are transforming drug discovery by enabling de novo design of both small molecules and therapeutic peptides. It contrasts modality-specific representations, benchmarks, and design objectives, highlighting representative systems such as Pocket2Mol, DiffSBDD, TargetDiff, RFdiffusion, and ProteinMPNN. The authors underscore data scarcity, the unreliability of scoring functions, and the necessity of experimental validation, arguing for integrated, automated DBTL pipelines to realize on-demand therapeutic design. The work articulates practical implications for accelerating discovery while outlining key challenges and opportunities to bridge computational designs with real-world synthesis and biology. Overall, diffusion models offer a promising path to shift from broad chemical exploration to targeted, efficient engineering of novel therapeutics within an automated discovery framework.

Abstract

Diffusion models have emerged as a leading framework in generative modeling, poised to transform the traditionally slow and costly process of drug discovery. This review provides a systematic comparison of their application in designing two principal therapeutic modalities: small molecules and therapeutic peptides. We dissect how the unified framework of iterative denoising is adapted to the distinct molecular representations, chemical spaces, and design objectives of each modality. For small molecules, these models excel at structure-based design, generating novel, pocket-fitting ligands with desired physicochemical properties, yet face the critical hurdle of ensuring chemical synthesizability. Conversely, for therapeutic peptides, the focus shifts to generating functional sequences and designing de novo structures, where the primary challenges are achieving biological stability against proteolysis, ensuring proper folding, and minimizing immunogenicity. Despite these distinct challenges, both domains face shared hurdles: the scarcity of high-quality experimental data, the reliance on inaccurate scoring functions for validation, and the crucial need for experimental validation. We conclude that the full potential of diffusion models will be unlocked by bridging these modality-specific gaps and integrating them into automated, closed-loop Design-Build-Test-Learn (DBTL) platforms, thereby shifting the paradigm from mere chemical exploration to the on-demand engineering of novel~therapeutics.

Paper Structure

This paper contains 21 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: A unified framework for de novo drug design using a conditional diffusion model. (a) The core engine is a conditional diffusion model, which comprises two processes. The noising process systematically corrupts a data structure, such as a protein ($X_0$), into Gaussian noise ($X_T$) over discrete timesteps. The generative process learns the reverse, creating novel structures by iteratively denoising from noise, guided by specific conditions. (b) For de novo small molecule design, the model generates molecular graphs or 3D coordinates conditioned on a target's binding pocket and desired properties (e.g., high activity, low toxicity) to produce diverse, pocket-fitting ligands. (c) For de novo therapeutic peptide design, the model generates peptide sequences and their corresponding 3D structures, conditioned on a target protein's surface, to design novel binders.
  • Figure 2: Contrasting Design Paradigms for Small Molecules and Therapeutic Peptides with Diffusion Models. The figure illustrates the distinct challenges and tailored AI-driven solutions for small molecules (left column, a,c,e,g) versus therapeutic peptides (right column, b,d,f,h). (a,b) The primary challenge for small molecules is navigating the vast, discrete chemical space, whereas for peptides, it is conquering the continuous conformational space to achieve a stable fold. (c,d) Consequently, diffusion models are employed for structure-based generation to fit small molecules into binding pockets, while for peptides, they perform structure-guided design by decorating a predefined scaffold. (e,f) Key downstream hurdles also differ: ensuring chemical synthesizability for small molecules versus achieving biological stability against degradation for peptides. (g,h) Finally, solutions are modality-specific: integrating chemical knowledge (e.g., reaction rules) to guide synthesis for small molecules, and engineering stability in peptides through modifications like cyclization or using non-canonical amino acids. Explanation of symbols: The red crosses (X) indicate synthetic infeasibility (e) or blocked enzymatic degradation (f, h). In (e), the colored spheres represent atoms within a complex molecular graph structure.
  • Figure 3: A Closed-Loop Paradigm for Drug Discovery Driven by AI and Automation. The figure depicts an autonomous Design-Build-Test-Learn (DBTL) cycle, representing a future paradigm for accelerated therapeutic discovery. This approach seamlessly integrates AI-powered design with automated laboratory execution to create a self-optimizing discovery engine. (a) Design: Generative AI models propose novel molecular candidates in silico. (b) Build: The most promising candidates are synthesized and purified using robotic platforms. (c) Test: The synthesized compounds are evaluated in high-throughput biological assays to generate activity data. (d) Learn: Experimental results are fed back into the AI model, which updates its knowledge and generates more informed hypotheses for the next cycle. This iterative process aims to dramatically shorten timelines and increase the success rate of finding novel medicines.