Table of Contents
Fetching ...

PPDiff: Diffusing in Hybrid Sequence-Structure Space for Protein-Protein Complex Design

Zhenqiao Song, Tiaoxiao Li, Lei Li, Martin Renqiang Min

TL;DR

PPDiff addresses the challenge of designing high-affinity protein binders for arbitrary targets without extensive wet-lab screening. It introduces a diffusion-based framework that jointly generates binder sequences and backbones conditioned on a target, built on the SSINC architecture and trained on a large PPBench dataset. The pretrained model achieves 50.00% top-1 success on general protein-protein complex design and improves performance on two real-world applications after finite-tuning. The work demonstrates a scalable approach to on-demand binder design with high novelty and diversity, though future work includes experimental validation.

Abstract

Designing protein-binding proteins with high affinity is critical in biomedical research and biotechnology. Despite recent advancements targeting specific proteins, the ability to create high-affinity binders for arbitrary protein targets on demand, without extensive rounds of wet-lab testing, remains a significant challenge. Here, we introduce PPDiff, a diffusion model to jointly design the sequence and structure of binders for arbitrary protein targets in a non-autoregressive manner. PPDiffbuilds upon our developed Sequence Structure Interleaving Network with Causal attention layers (SSINC), which integrates interleaved self-attention layers to capture global amino acid correlations, k-nearest neighbor (kNN) equivariant graph layers to model local interactions in three-dimensional (3D) space, and causal attention layers to simplify the intricate interdependencies within the protein sequence. To assess PPDiff, we curate PPBench, a general protein-protein complex dataset comprising 706,360 complexes from the Protein Data Bank (PDB). The model is pretrained on PPBenchand finetuned on two real-world applications: target-protein mini-binder complex design and antigen-antibody complex design. PPDiffconsistently surpasses baseline methods, achieving success rates of 50.00%, 23.16%, and 16.89% for the pretraining task and the two downstream applications, respectively. The code, data and models are available at https://github.com/JocelynSong/PPDiff.

PPDiff: Diffusing in Hybrid Sequence-Structure Space for Protein-Protein Complex Design

TL;DR

PPDiff addresses the challenge of designing high-affinity protein binders for arbitrary targets without extensive wet-lab screening. It introduces a diffusion-based framework that jointly generates binder sequences and backbones conditioned on a target, built on the SSINC architecture and trained on a large PPBench dataset. The pretrained model achieves 50.00% top-1 success on general protein-protein complex design and improves performance on two real-world applications after finite-tuning. The work demonstrates a scalable approach to on-demand binder design with high novelty and diversity, though future work includes experimental validation.

Abstract

Designing protein-binding proteins with high affinity is critical in biomedical research and biotechnology. Despite recent advancements targeting specific proteins, the ability to create high-affinity binders for arbitrary protein targets on demand, without extensive rounds of wet-lab testing, remains a significant challenge. Here, we introduce PPDiff, a diffusion model to jointly design the sequence and structure of binders for arbitrary protein targets in a non-autoregressive manner. PPDiffbuilds upon our developed Sequence Structure Interleaving Network with Causal attention layers (SSINC), which integrates interleaved self-attention layers to capture global amino acid correlations, k-nearest neighbor (kNN) equivariant graph layers to model local interactions in three-dimensional (3D) space, and causal attention layers to simplify the intricate interdependencies within the protein sequence. To assess PPDiff, we curate PPBench, a general protein-protein complex dataset comprising 706,360 complexes from the Protein Data Bank (PDB). The model is pretrained on PPBenchand finetuned on two real-world applications: target-protein mini-binder complex design and antigen-antibody complex design. PPDiffconsistently surpasses baseline methods, achieving success rates of 50.00%, 23.16%, and 16.89% for the pretraining task and the two downstream applications, respectively. The code, data and models are available at https://github.com/JocelynSong/PPDiff.

Paper Structure

This paper contains 38 sections, 16 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: (a) Overall architecture of our proposed PPDiff. (b) We first pretrain PPDiff on PPBench, a general protein-protein complex dataset curated from PDB. Then we can finetune the pretrained model on important real-world protein-protein complex design applications, such as target-protein mini-binder complex design and antigen-antibody complex design.
  • Figure 2: Ablation study on: (a) causal attention layer size, (b) different diffusion steps, (c-d) additional Swiss-Prot data pretraining, (e) different model scales, (f-h) different total candidate sizes.
  • Figure 3: Designed protein complexes from: (a-c) general protein-protein complex design, (d-f) target-protein mini-binder complex design, and (g-i) antigen-antibody complex design. Target proteins are shown in blue, and the designed binder proteins in green, with light chains in pink for antigen-antibody complexes. PPDiff is able to design high-affinity protein-binding proteins across diverse target scaffolds.
  • Figure 4: Designed complexes for general protein-protein complex design by our PPDiff. All of them achieve an ipTM score approaching or higher than 0.8, pTM score above 0.8, PAE lower than 10 and pLDDT better than 80. These designed binder sequences also have novelty scores higher than 80%, validating that PPDiff is capable of designing novel and high-affinity protein-binding proteins across diverse protein targets.
  • Figure 5: Designed complexes by our PPDiff for target-protein mini-binder complex design. All of them achieve an ipTM score higher than 0.7, pTM score above 0.7, PAE lower than 10 and pLDDT better than 80. These designed binder sequences also have novelty scores higher than 80%, validating that PPDiff is capable of designing novel and high-affinity binder proteins across diverse protein targets.
  • ...and 1 more figures