Table of Contents
Fetching ...

Out of Many, One: Designing and Scaffolding Proteins at the Scale of the Structural Universe with Genie 2

Yeqing Lin, Minji Lee, Zhao Zhang, Mohammed AlQuraishi

TL;DR

This work introduces Genie 2, an advanced diffusion-based framework for structure- and motif-aware protein design at scale. By employing motif-conditioned, SE(3)-aware diffusion with a multi-motif scaffolding framework and large-scale AFDB augmentation, Genie 2 achieves state-of-the-art designability, diversity, and novelty in unconditional generation and demonstrates substantial capabilities for single- and multi-motif scaffolding. The approach enables designing proteins with multiple independent functional motifs and thereby expanding the design space for enzymes, biosensors, and therapeutics. While offering strong performance, Genie 2 trades off sampling speed and computational complexity, pointing to future work on faster inference and scaling to even larger proteins. The work provides a solid foundation and public-release code and weights to advance structure-based protein design research and applications.

Abstract

Protein diffusion models have emerged as a promising approach for protein design. One such pioneering model is Genie, a method that asymmetrically represents protein structures during the forward and backward processes, using simple Gaussian noising for the former and expressive SE(3)-equivariant attention for the latter. In this work we introduce Genie 2, extending Genie to capture a larger and more diverse protein structure space through architectural innovations and massive data augmentation. Genie 2 adds motif scaffolding capabilities via a novel multi-motif framework that designs co-occurring motifs with unspecified inter-motif positions and orientations. This makes possible complex protein designs that engage multiple interaction partners and perform multiple functions. On both unconditional and conditional generation, Genie 2 achieves state-of-the-art performance, outperforming all known methods on key design metrics including designability, diversity, and novelty. Genie 2 also solves more motif scaffolding problems than other methods and does so with more unique and varied solutions. Taken together, these advances set a new standard for structure-based protein design. Genie 2 inference and training code, as well as model weights, are freely available at: https://github.com/aqlaboratory/genie2.

Out of Many, One: Designing and Scaffolding Proteins at the Scale of the Structural Universe with Genie 2

TL;DR

This work introduces Genie 2, an advanced diffusion-based framework for structure- and motif-aware protein design at scale. By employing motif-conditioned, SE(3)-aware diffusion with a multi-motif scaffolding framework and large-scale AFDB augmentation, Genie 2 achieves state-of-the-art designability, diversity, and novelty in unconditional generation and demonstrates substantial capabilities for single- and multi-motif scaffolding. The approach enables designing proteins with multiple independent functional motifs and thereby expanding the design space for enzymes, biosensors, and therapeutics. While offering strong performance, Genie 2 trades off sampling speed and computational complexity, pointing to future work on faster inference and scaling to even larger proteins. The work provides a solid foundation and public-release code and weights to advance structure-based protein design research and applications.

Abstract

Protein diffusion models have emerged as a promising approach for protein design. One such pioneering model is Genie, a method that asymmetrically represents protein structures during the forward and backward processes, using simple Gaussian noising for the former and expressive SE(3)-equivariant attention for the latter. In this work we introduce Genie 2, extending Genie to capture a larger and more diverse protein structure space through architectural innovations and massive data augmentation. Genie 2 adds motif scaffolding capabilities via a novel multi-motif framework that designs co-occurring motifs with unspecified inter-motif positions and orientations. This makes possible complex protein designs that engage multiple interaction partners and perform multiple functions. On both unconditional and conditional generation, Genie 2 achieves state-of-the-art performance, outperforming all known methods on key design metrics including designability, diversity, and novelty. Genie 2 also solves more motif scaffolding problems than other methods and does so with more unique and varied solutions. Taken together, these advances set a new standard for structure-based protein design. Genie 2 inference and training code, as well as model weights, are freely available at: https://github.com/aqlaboratory/genie2.
Paper Structure (40 sections, 6 equations, 13 figures, 7 tables, 1 algorithm)

This paper contains 40 sections, 6 equations, 13 figures, 7 tables, 1 algorithm.

Figures (13)

  • Figure 1: Genie 2 architecture (top), which extends Genie to enable scaffolding on (multiple) motifs. It consists of an SE(3)-invariant encoder that transforms input features into single residue and pair residue-residue representations, and an SE(3)-equivariant decoder that updates frames based on single representations, pair representations, and input reference frames. Example inputs to the model for single- and multi-motif scaffolding problems are shown (bottom-left green box), along with the corresponding generated designs (bottom-right box). In single motif scaffolding (top row), the motif may be contiguous or non-contiguous but all inter-residue positions and orientations are defined. In multi-motif scaffolding (bottom row), inter-motif geometry is left unspecified. For input sequences, white boxes denote masked out regions corresponding to the scaffold.
  • Figure 2: Visualizations of in-distribution performance on unconditional generation. (A) Secondary structure distributions of proteins generated by Chroma, RFDiffusion and Genie 2. For reference, we also include the secondary structure distribution of 1,000 structures randomly drawn from AFDB (far right). (B) Secondary structure distributions of proteins generated by Genie 2 when sampling noise scale ($\gamma$ in equation (3)) is set to 0 (left) and 1 (right). (C) Self-consistency results on 1,000 randomly chosen structures from the PDB and clustered AFDB datasets.
  • Figure 3: Assessment of methods by sequence length. For each method/sequence length combination, we generate 100 structures. (A) Box-and-whisker plots of scRMSDs between generated structures and their most similar ESMFold-predicted structures. Asterisks (*) indicate that sequence lengths exceed the maximum seen during training. (B-C) Plots of designability (B) and diversity (C) as a function of sequence length. (D) Example structures generated by Genie 2.
  • Figure 4: Comparison of Genie 2 and RFDiffusion on single-motif scaffolding. (A) Performance of Genie 2 and RFDiffusion across 24 single-motif scaffolding tasks. Inset (top right) shows a scatter plot of the (unique) success rate of Genie 2 vs. RFDiffusion; each point represents a scaffolding task. Summary statistics are shown in table (left). Example designs are shown (bottom) for successful task 3IXT (green) as well as failed task 4JHW (red). Scaffolds (white), motifs (blue), and unsatisfied sought motifs (red) are overlaid. (B) Plot of number of unique successes as a function of sample size.
  • Figure 5: Performance of Genie 2 on multi-motif scaffolding tasks. (A) Successful designs for task 1PRW_four (scaffolding with four $\text{Ca}^{2+}$ ion binding sites) and 4JHW+5WN9 (scaffolding with RSV-F site II epitope and RSV-G 2D10 epitope). Scaffolds are in grey and distinct motifs are colored differently. (B) (Top) Successful design for multi-epitope immunogen. (Bottom) Individual epitope designs superposed over target structures (red).
  • ...and 8 more figures