PPDiff: Diffusing in Hybrid Sequence-Structure Space for Protein-Protein Complex Design
Zhenqiao Song, Tiaoxiao Li, Lei Li, Martin Renqiang Min
TL;DR
PPDiff addresses the challenge of designing high-affinity protein binders for arbitrary targets without extensive wet-lab screening. It introduces a diffusion-based framework that jointly generates binder sequences and backbones conditioned on a target, built on the SSINC architecture and trained on a large PPBench dataset. The pretrained model achieves 50.00% top-1 success on general protein-protein complex design and improves performance on two real-world applications after finite-tuning. The work demonstrates a scalable approach to on-demand binder design with high novelty and diversity, though future work includes experimental validation.
Abstract
Designing protein-binding proteins with high affinity is critical in biomedical research and biotechnology. Despite recent advancements targeting specific proteins, the ability to create high-affinity binders for arbitrary protein targets on demand, without extensive rounds of wet-lab testing, remains a significant challenge. Here, we introduce PPDiff, a diffusion model to jointly design the sequence and structure of binders for arbitrary protein targets in a non-autoregressive manner. PPDiffbuilds upon our developed Sequence Structure Interleaving Network with Causal attention layers (SSINC), which integrates interleaved self-attention layers to capture global amino acid correlations, k-nearest neighbor (kNN) equivariant graph layers to model local interactions in three-dimensional (3D) space, and causal attention layers to simplify the intricate interdependencies within the protein sequence. To assess PPDiff, we curate PPBench, a general protein-protein complex dataset comprising 706,360 complexes from the Protein Data Bank (PDB). The model is pretrained on PPBenchand finetuned on two real-world applications: target-protein mini-binder complex design and antigen-antibody complex design. PPDiffconsistently surpasses baseline methods, achieving success rates of 50.00%, 23.16%, and 16.89% for the pretraining task and the two downstream applications, respectively. The code, data and models are available at https://github.com/JocelynSong/PPDiff.
