Table of Contents
Fetching ...

Proteina: Scaling Flow-based Protein Structure Generative Models

Tomas Geffner, Kieran Didi, Zuobai Zhang, Danny Reidenbach, Zhonglin Cao, Jason Yim, Mario Geiger, Christian Dallago, Emine Kucukbenli, Arash Vahdat, Karsten Kreis

TL;DR

Proteína tackles the challenge of scalable, controllable de novo protein backbone design by introducing a large-scale flow-based model conditioned on hierarchical fold classes using a non-equivariant transformer. It demonstrates state-of-the-art performance on unconditional and fold-class-conditioned backbone generation, scales to 800 residues, and trains on up to 21M synthetic structures. The authors also introduce distribution-level metrics (FPSD, fS, fJSD) and show novel guidance strategies like classifier-free guidance, autoguidance, and LoRA fine-tuning. Together, these contributions enable unprecedented control and scale in protein design, with practical implications for designing long, diverse, and designable backbones. The work establishes a foundation for large-scale protein structure generation and future exploration of guided, efficient sampling and data-driven design workflows.

Abstract

Recently, diffusion- and flow-based generative models of protein structures have emerged as a powerful tool for de novo protein design. Here, we develop Proteina, a new large-scale flow-based protein backbone generator that utilizes hierarchical fold class labels for conditioning and relies on a tailored scalable transformer architecture with up to 5x as many parameters as previous models. To meaningfully quantify performance, we introduce a new set of metrics that directly measure the distributional similarity of generated proteins with reference sets, complementing existing metrics. We further explore scaling training data to millions of synthetic protein structures and explore improved training and sampling recipes adapted to protein backbone generation. This includes fine-tuning strategies like LoRA for protein backbones, new guidance methods like classifier-free guidance and autoguidance for protein backbones, and new adjusted training objectives. Proteina achieves state-of-the-art performance on de novo protein backbone design and produces diverse and designable proteins at unprecedented length, up to 800 residues. The hierarchical conditioning offers novel control, enabling high-level secondary-structure guidance as well as low-level fold-specific generation.

Proteina: Scaling Flow-based Protein Structure Generative Models

TL;DR

Proteína tackles the challenge of scalable, controllable de novo protein backbone design by introducing a large-scale flow-based model conditioned on hierarchical fold classes using a non-equivariant transformer. It demonstrates state-of-the-art performance on unconditional and fold-class-conditioned backbone generation, scales to 800 residues, and trains on up to 21M synthetic structures. The authors also introduce distribution-level metrics (FPSD, fS, fJSD) and show novel guidance strategies like classifier-free guidance, autoguidance, and LoRA fine-tuning. Together, these contributions enable unprecedented control and scale in protein design, with practical implications for designing long, diverse, and designable backbones. The work establishes a foundation for large-scale protein structure generation and future exploration of guided, efficient sampling and data-driven design workflows.

Abstract

Recently, diffusion- and flow-based generative models of protein structures have emerged as a powerful tool for de novo protein design. Here, we develop Proteina, a new large-scale flow-based protein backbone generator that utilizes hierarchical fold class labels for conditioning and relies on a tailored scalable transformer architecture with up to 5x as many parameters as previous models. To meaningfully quantify performance, we introduce a new set of metrics that directly measure the distributional similarity of generated proteins with reference sets, complementing existing metrics. We further explore scaling training data to millions of synthetic protein structures and explore improved training and sampling recipes adapted to protein backbone generation. This includes fine-tuning strategies like LoRA for protein backbones, new guidance methods like classifier-free guidance and autoguidance for protein backbones, and new adjusted training objectives. Proteina achieves state-of-the-art performance on de novo protein backbone design and produces diverse and designable proteins at unprecedented length, up to 800 residues. The hierarchical conditioning offers novel control, enabling high-level secondary-structure guidance as well as low-level fold-specific generation.

Paper Structure

This paper contains 66 sections, 35 equations, 25 figures, 16 tables, 1 algorithm.

Figures (25)

  • Figure 1: Proteína. We use flow-matching and learn a flow to transform a Gaussian distribution over initial protein backbone coordinates (residues' $C_\alpha$ atoms) into realistic protein structures. Proteína relies on a scalable transformer-based architecture and can be conditioned on hierarchical fold class labels for improved controllability and complex protein structure design tasks.
  • Figure 2: Proteína Samples. Designable backbones generated unconditionally by $\mathcal{M}_{\textrm{FS}}$ model (${<}$250 residues).
  • Figure 3: Dataset Statistics.(a) Dataset size comparisons. (b) Sunburst plot of the hierarchical fold class labels in our largest dataset $\mathcal{D}_{\textrm{21M}}$, depicting the hierarchical label structure and the relative sizes of the three hierarchical fold class categories C, A, and T.
  • Figure 4: Long Proteína Samples. Chain lengths in (a)-(g): [300, 400, 500, 600, 700, 800, 800]. (a) "Mixed $\alpha/\beta$"-guided. (b) "Mainly $\beta$"-guided. (e) "Mixed $\alpha/\beta$"-guided. Others unconditional. All samples designable.
  • Figure 5: Proteína's transformer architecture.(a)-(c) We first create a sequence representation, sequence conditioning features, and a pair representation. (d) They are processed by conditioned and biased (through the pair representation) multi-head attention layers, described in (e). We use a variant of QK normalization, applying LayerNorm (LN) to the Q and K inputs to the attention operation, before the multi-head split. Optionally, the pair representation can be updated. See \ref{['app:network_components']} for the Pair Update, Adaptive LN, and Adaptive Scale modules.
  • ...and 20 more figures