Table of Contents
Fetching ...

Generating Highly Designable Proteins with Geometric Algebra Flow Matching

Simon Wagner, Leif Seute, Vsevolod Viliuga, Nicolas Wolf, Frauke Gräter, Jan Stühmer

TL;DR

Clifford Frame Attention (CFA), an extension of the invariant point attention (IPA) architecture from AlphaFold2, is proposed, in which the backbone residue frames and geometric features are represented in the projective geometric algebra.

Abstract

We introduce a generative model for protein backbone design utilizing geometric products and higher order message passing. In particular, we propose Clifford Frame Attention (CFA), an extension of the invariant point attention (IPA) architecture from AlphaFold2, in which the backbone residue frames and geometric features are represented in the projective geometric algebra. This enables to construct geometrically expressive messages between residues, including higher order terms, using the bilinear operations of the algebra. We evaluate our architecture by incorporating it into the framework of FrameFlow, a state-of-the-art flow matching model for protein backbone generation. The proposed model achieves high designability, diversity and novelty, while also sampling protein backbones that follow the statistical distribution of secondary structure elements found in naturally occurring proteins, a property so far only insufficiently achieved by many state-of-the-art generative models.

Generating Highly Designable Proteins with Geometric Algebra Flow Matching

TL;DR

Clifford Frame Attention (CFA), an extension of the invariant point attention (IPA) architecture from AlphaFold2, is proposed, in which the backbone residue frames and geometric features are represented in the projective geometric algebra.

Abstract

We introduce a generative model for protein backbone design utilizing geometric products and higher order message passing. In particular, we propose Clifford Frame Attention (CFA), an extension of the invariant point attention (IPA) architecture from AlphaFold2, in which the backbone residue frames and geometric features are represented in the projective geometric algebra. This enables to construct geometrically expressive messages between residues, including higher order terms, using the bilinear operations of the algebra. We evaluate our architecture by incorporating it into the framework of FrameFlow, a state-of-the-art flow matching model for protein backbone generation. The proposed model achieves high designability, diversity and novelty, while also sampling protein backbones that follow the statistical distribution of secondary structure elements found in naturally occurring proteins, a property so far only insufficiently achieved by many state-of-the-art generative models.

Paper Structure

This paper contains 51 sections, 3 theorems, 41 equations, 14 figures, 9 tables, 6 algorithms.

Key Result

theorem A.2

Cartan-Dieudonné theorem Let $(V, q)$ be a nondegenerate space of dimension $n$, then any orthogonal transformation can be written as a composition of at most $n$ reflections.

Figures (14)

  • Figure 1: (A) Protein backbone residue with three backbone atoms represented by a coordinate frame. (B) In PGA, a frame can be represented via the geometric product of four planes. Two of the planes parameterize the frame's rotation around their line of intersection, while the other two encode the frame's translation along the separation vector between them. (C) An exemplary protein backbone structure containing an $\alpha$-helix and a $\beta$-sheet. Lines (red), planes (violet) and Euclidean frames (blue) can all be embedded as elements of PGA, facilitating a geometric inductive bias for learning representations of the abstract geometry of the protein.
  • Figure 2: Representative examples of designable protein backbones generated with GAFL (white) and the output of the refolding pipeline (colored), comprising ProteinMPNN and ESMFold.
  • Figure 3: (A) Performance of evaluated models in terms of designability and secondary structure content as a function of backbone length. 200 backbones were generated for each model at each length $\in\{60,80,100,150,200,250,300\}$. (B) Comparison of the secondary structure distributions of backbones generated by GAFL and RFdiffusion from (A) to the PDB dataset filtered by the respective protein lengths along with the Wasserstein distance (WD) between the distributions. (C) Examples of designable backbones generated by GAFL for lengths 450 and 500. We also report TM scores of the backbones to the closest hit in the PDB database computed with FoldSeek.
  • Figure 4: Helix content and designabilities of 90 model checkpoints sampled during three training runs on the PDB dataset for GAFL and retrained FrameFlow, respectively. For each checkpoint, we sample 40 backbones per length in $\{100,150,\ldots,300\}$.
  • Figure A.1: Visualization of the different geometric primitives in $\IfNoValueTF{3}{ \mathbb{G} }{ \IfNoValueTF{-NoValue-}{ \mathbb{G}_{3} }{ \IfNoValueTF{-NoValue-}{ \mathbb{G}_{{3},{-NoValue-}} }{ \mathbb{G}_{{3},{-NoValue-},{-NoValue-}} } } }$. Vectors are directed line segments, bivectors are oriented areas and trivectors are oriented volumes. Orientations are indicated by arrows that have the same sense of rotation as the corresponding basis vectors when linked together end to tip.
  • ...and 9 more figures

Theorems & Definitions (3)

  • theorem A.2
  • theorem A.3
  • theorem A.4