Table of Contents
Fetching ...

From thermodynamics to protein design: Diffusion models for biomolecule generation towards autonomous protein engineering

Wen-ran Li, Xavier F. Cadet, David Medina-Ortiz, Mehdi D. Davari, Ramanathan Sowdhamini, Cedric Damour, Yu Li, Alain Miranville, Frederic Cadet

TL;DR

This review surveys diffusion-model approaches for biomolecule generation, focusing on proteins, peptides, small molecules, and protein–ligand interactions. It delineates two principal diffusion formalisms, DDPM and Score-based Generative Models, and emphasizes $SE(3)$/$E(3)$ equivariance as essential for maintaining physical consistency in 3D biomolecular structures. It catalogs representative methods across sequence and structure generation (notably RFDiffusion, FrameDiff, Genie, Chroma, AlphaFold3 diffusion modules), peptide design (ProT-Diff, AMP-Diffusion, PepFlow), and molecular design (GeoDiff, SubGDiff, EDM-based methods, DiffDock), highlighting advances in geometry-aware denoisers and EGNNbackbones. The discussion identifies remaining challenges in evaluation, synthesis feasibility, and dynamic behavior, and points to future directions including stronger physical constraints, broader generalization, and deeper integration with state-of-the-art structure prediction and design pipelines for autonomous protein engineering.

Abstract

Protein design with desirable properties has been a significant challenge for many decades. Generative artificial intelligence is a promising approach and has achieved great success in various protein generation tasks. Notably, diffusion models stand out for their robust mathematical foundations and impressive generative capabilities, offering unique advantages in certain applications such as protein design. In this review, we first give the definition and characteristics of diffusion models and then focus on two strategies: Denoising Diffusion Probabilistic Models and Score-based Generative Models, where DDPM is the discrete form of SGM. Furthermore, we discuss their applications in protein design, peptide generation, drug discovery, and protein-ligand interaction. Finally, we outline the future perspectives of diffusion models to advance autonomous protein design and engineering. The E(3) group consists of all rotations, reflections, and translations in three-dimensions. The equivariance on the E(3) group can keep the physical stability of the frame of each amino acid as much as possible, and we reflect on how to keep the diffusion model E(3) equivariant for protein generation.

From thermodynamics to protein design: Diffusion models for biomolecule generation towards autonomous protein engineering

TL;DR

This review surveys diffusion-model approaches for biomolecule generation, focusing on proteins, peptides, small molecules, and protein–ligand interactions. It delineates two principal diffusion formalisms, DDPM and Score-based Generative Models, and emphasizes / equivariance as essential for maintaining physical consistency in 3D biomolecular structures. It catalogs representative methods across sequence and structure generation (notably RFDiffusion, FrameDiff, Genie, Chroma, AlphaFold3 diffusion modules), peptide design (ProT-Diff, AMP-Diffusion, PepFlow), and molecular design (GeoDiff, SubGDiff, EDM-based methods, DiffDock), highlighting advances in geometry-aware denoisers and EGNNbackbones. The discussion identifies remaining challenges in evaluation, synthesis feasibility, and dynamic behavior, and points to future directions including stronger physical constraints, broader generalization, and deeper integration with state-of-the-art structure prediction and design pipelines for autonomous protein engineering.

Abstract

Protein design with desirable properties has been a significant challenge for many decades. Generative artificial intelligence is a promising approach and has achieved great success in various protein generation tasks. Notably, diffusion models stand out for their robust mathematical foundations and impressive generative capabilities, offering unique advantages in certain applications such as protein design. In this review, we first give the definition and characteristics of diffusion models and then focus on two strategies: Denoising Diffusion Probabilistic Models and Score-based Generative Models, where DDPM is the discrete form of SGM. Furthermore, we discuss their applications in protein design, peptide generation, drug discovery, and protein-ligand interaction. Finally, we outline the future perspectives of diffusion models to advance autonomous protein design and engineering. The E(3) group consists of all rotations, reflections, and translations in three-dimensions. The equivariance on the E(3) group can keep the physical stability of the frame of each amino acid as much as possible, and we reflect on how to keep the diffusion model E(3) equivariant for protein generation.
Paper Structure (43 sections, 3 theorems, 30 equations, 12 figures, 3 tables)

This paper contains 43 sections, 3 theorems, 30 equations, 12 figures, 3 tables.

Key Result

Theorem 4.1

(Permutation equivariance of graph neural network) Consider consistent permutations of the shift operator $\hat{s}=P^{T}sP$ and input signal $\hat{x}=P^{T}x$. Then

Figures (12)

  • Figure 1: Visualization of diffusion models operating on the image generation. During the diffusion process, the image becomes blurred until it becomes a Gaussian distribution. The reverse process is a denoising process, and the image gradually becomes clear.
  • Figure 2: $SE(3)$ equivariant diffusion models for protein structure generation. RFDiffusion, FrameDiff and Genie utilize RoseTTAFold, IPA and $SE(3)$-equivariant denoiser as the single step of the denoise process in the diffusion model, respectively. Boxes in pink color are $SE(3)$ equivariant blocks. $SE(3)$ equivariant keeps the frames of each amino acid physically stable.
  • Figure 3: Timeline of major advancements in protein design methods from March 2022 to May 2024. Each event marks the introduction of a significant model or method, categorized by its underlying computational framework. The models are color-coded based on their primary components: Red represents EGNN-based methods, orange corresponds to RoseTTAFold-inspired methods, blue highlights IPA-based methods, and cyan denotes ESM-based methods.
  • Figure 4: Overview of EDM (Equivariant Diffusion Models) and its extensions for molecular generation tasks. The top box represents the foundational EDM model, which uses 3D point cloud representation with E(3) equivariance to handle molecular structures. The figure highlights the key limitations of earlier models (shown in blue boxes). It demonstrates how subsequent models address these challenges through novel methods. Irregular Training Space: GeoLDM uses latent space encoding but performs poorly in generating realistic molecules. SubDiff solves this issue by introducing a subgraph extraction process to improve generation quality. Scalability to Complex Molecules: MDM considers covalent bonds and Van der Waals forces but cannot adapt to target-specific molecular pockets. PMDM incorporates a dual equivariant encoder and Gaussian noise to handle complex protein-ligand interactions. Limited Modality: MiDi combines 2D connectivity graphs and 3D point clouds but struggles with poor adaptation to the data distribution. EQGAT-Diff enhances performance by introducing an EQGAT encoder for better data alignment. Unrealistic Molecules: MolDiff generates molecules with inaccurate ligand interactions. MolSnapper improves molecular realism by accurately representing ligand interactions within target pockets.
  • Figure 5: Mindmap of the 56 models featured in this review: the models boxed in continuous blue line are the $SE(3)$ equivariant models, the models gray boxes (like Pepflow and PepGLAD) are the $E(3)$ equivariant models, and the blue-shaded ones are models based on Alphafold2. The dark blue branch line indicates the dependence on model classification, and the light blue branch line suggests that the later model is based on the former.
  • ...and 7 more figures

Theorems & Definitions (12)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.1
  • Theorem 4.1
  • Definition 4.2
  • Definition 4.3
  • Lemma 4.4
  • Theorem 4.5
  • Definition 5.1
  • Definition 5.2
  • ...and 2 more