Table of Contents
Fetching ...

Secondary Structure-Guided Novel Protein Sequence Generation with Latent Graph Diffusion

Yutong Hu, Yang Tan, Andi Han, Lirong Zheng, Liang Hong, Bingxin Zhou

TL;DR

This work tackles de novo protein sequence design under coarse secondary-structure constraints. It introduces CPDiffusion-SS, a latent graph diffusion framework with an encoder–decoder VAE and an EGNN-based latent diffusion module that operates on SS-level representations conditioned by SS graphs. The model is trained in two stages on large-scale protein data (AlphaFoldDB and CATH4.3) and demonstrates superior diversity, novelty, and SS-consistency compared with baselines across open benchmarks, with case studies illustrating practical design potential. By integrating structure-aware diffusion with autoregressive decoding, CPDiffusion-SS provides a scalable, flexible pathway for generating protein sequences that honor predefined secondary-structure patterns, informing future design and biotech applications.

Abstract

The advent of deep learning has introduced efficient approaches for de novo protein sequence design, significantly improving success rates and reducing development costs compared to computational or experimental methods. However, existing methods face challenges in generating proteins with diverse lengths and shapes while maintaining key structural features. To address these challenges, we introduce CPDiffusion-SS, a latent graph diffusion model that generates protein sequences based on coarse-grained secondary structural information. CPDiffusion-SS offers greater flexibility in producing a variety of novel amino acid sequences while preserving overall structural constraints, thus enhancing the reliability and diversity of generated proteins. Experimental analyses demonstrate the significant superiority of the proposed method in producing diverse and novel sequences, with CPDiffusion-SS surpassing popular baseline methods on open benchmarks across various quantitative measurements. Furthermore, we provide a series of case studies to highlight the biological significance of the generation performance by the proposed method. The source code is publicly available at https://github.com/riacd/CPDiffusion-SS

Secondary Structure-Guided Novel Protein Sequence Generation with Latent Graph Diffusion

TL;DR

This work tackles de novo protein sequence design under coarse secondary-structure constraints. It introduces CPDiffusion-SS, a latent graph diffusion framework with an encoder–decoder VAE and an EGNN-based latent diffusion module that operates on SS-level representations conditioned by SS graphs. The model is trained in two stages on large-scale protein data (AlphaFoldDB and CATH4.3) and demonstrates superior diversity, novelty, and SS-consistency compared with baselines across open benchmarks, with case studies illustrating practical design potential. By integrating structure-aware diffusion with autoregressive decoding, CPDiffusion-SS provides a scalable, flexible pathway for generating protein sequences that honor predefined secondary-structure patterns, informing future design and biotech applications.

Abstract

The advent of deep learning has introduced efficient approaches for de novo protein sequence design, significantly improving success rates and reducing development costs compared to computational or experimental methods. However, existing methods face challenges in generating proteins with diverse lengths and shapes while maintaining key structural features. To address these challenges, we introduce CPDiffusion-SS, a latent graph diffusion model that generates protein sequences based on coarse-grained secondary structural information. CPDiffusion-SS offers greater flexibility in producing a variety of novel amino acid sequences while preserving overall structural constraints, thus enhancing the reliability and diversity of generated proteins. Experimental analyses demonstrate the significant superiority of the proposed method in producing diverse and novel sequences, with CPDiffusion-SS surpassing popular baseline methods on open benchmarks across various quantitative measurements. Furthermore, we provide a series of case studies to highlight the biological significance of the generation performance by the proposed method. The source code is publicly available at https://github.com/riacd/CPDiffusion-SS
Paper Structure (25 sections, 7 equations, 4 figures, 2 tables)

This paper contains 25 sections, 7 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The illustrative figure of CPDiffusion-SS. The model embeds AA sequences into a hidden space of secondary structures using the latent graph diffusion model. The generated latent secondary structure representation is then translated into AA sequences of variable lengths by an autoregressive decoder.
  • Figure 2: Illustrative architecture of the latent diffusion model.
  • Figure 3: Learning curve with different (a) pooling layers; (b) learning rate and dropout rate. (c) noise schedules in the diffusion model.
  • Figure 4: Predicted 3D structures and composition of secondary structures on three cases from the test dataset. Here we use red, yellow, and blue colors to represent helices (H), sheets (E), and coils (C), respectively.