Table of Contents
Fetching ...

Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles

Zhanghan Ni, Yanjing Li, Zeju Qiu, Bernhard Schölkopf, Hongyu Guo, Weiyang Liu, Shengchao Liu

TL;DR

RigidSSL is introduced, a geometric pretraining framework that front-loads geometry learning prior to generative finetuning and improves designability by up to 43% while enhancing novelty and diversity in unconditional generation.

Abstract

Generative models have recently advanced $\textit{de novo}$ protein design by learning the statistical regularities of natural structures. However, current approaches face three key limitations: (1) Existing methods cannot jointly learn protein geometry and design tasks, where pretraining can be a solution; (2) Current pretraining methods mostly rely on local, non-rigid atomic representations for property prediction downstream tasks, limiting global geometric understanding for protein generation tasks; and (3) Existing approaches have yet to effectively model the rich dynamic and conformational information of protein structures. To overcome these issues, we introduce $\textbf{RigidSSL}$ ($\textit{Rigidity-Aware Self-Supervised Learning}$), a geometric pretraining framework that front-loads geometry learning prior to generative finetuning. Phase I (RigidSSL-Perturb) learns geometric priors from 432K structures from the AlphaFold Protein Structure Database with simulated perturbations. Phase II (RigidSSL-MD) refines these representations on 1.3K molecular dynamics trajectories to capture physically realistic transitions. Underpinning both phases is a bi-directional, rigidity-aware flow matching objective that jointly optimizes translational and rotational dynamics to maximize mutual information between conformations. Empirically, RigidSSL variants improve designability by up to 43\% while enhancing novelty and diversity in unconditional generation. Furthermore, RigidSSL-Perturb improves the success rate by 5.8\% in zero-shot motif scaffolding and RigidSSL-MD captures more biophysically realistic conformational ensembles in G protein-coupled receptor modeling. The code is available at: https://github.com/ZhanghanNi/RigidSSL.git.

Rigidity-Aware Geometric Pretraining for Protein Design and Conformational Ensembles

TL;DR

RigidSSL is introduced, a geometric pretraining framework that front-loads geometry learning prior to generative finetuning and improves designability by up to 43% while enhancing novelty and diversity in unconditional generation.

Abstract

Generative models have recently advanced protein design by learning the statistical regularities of natural structures. However, current approaches face three key limitations: (1) Existing methods cannot jointly learn protein geometry and design tasks, where pretraining can be a solution; (2) Current pretraining methods mostly rely on local, non-rigid atomic representations for property prediction downstream tasks, limiting global geometric understanding for protein generation tasks; and (3) Existing approaches have yet to effectively model the rich dynamic and conformational information of protein structures. To overcome these issues, we introduce (), a geometric pretraining framework that front-loads geometry learning prior to generative finetuning. Phase I (RigidSSL-Perturb) learns geometric priors from 432K structures from the AlphaFold Protein Structure Database with simulated perturbations. Phase II (RigidSSL-MD) refines these representations on 1.3K molecular dynamics trajectories to capture physically realistic transitions. Underpinning both phases is a bi-directional, rigidity-aware flow matching objective that jointly optimizes translational and rotational dynamics to maximize mutual information between conformations. Empirically, RigidSSL variants improve designability by up to 43\% while enhancing novelty and diversity in unconditional generation. Furthermore, RigidSSL-Perturb improves the success rate by 5.8\% in zero-shot motif scaffolding and RigidSSL-MD captures more biophysically realistic conformational ensembles in G protein-coupled receptor modeling. The code is available at: https://github.com/ZhanghanNi/RigidSSL.git.
Paper Structure (44 sections, 34 equations, 4 figures, 8 tables, 1 algorithm)

This paper contains 44 sections, 34 equations, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of RigidSSL. (a) View construction in RigidSSL-Perturb: translational noise in $\mathbb{R}^3$ and rotational noise in $\operatorname{SO}(3)$ are applied to generate perturbations in the rigid body motion group $\mathrm{SE}(3)$. (b) View construction in RigidSSL-MD: perturbed states are obtained by sampling conformational frames from MD trajectories. (c) Rigidity-based pretraining in RigidSSL: proteins are canonicalized into a reference frame, intermediate states are constructed via interpolation of translations and rotations for each rigid residue frame, and bi-directional flow matching is applied for pretraining. Details can be found in \ref{['Sec: Methods']}.
  • Figure 2: Distribution of secondary structure elements ($\alpha$-helices, $\beta$-sheets, and coils) in protein structure database (a-b) and in designable proteins (scRMSD $\leq$ 2.0 Å) generated by FoldFlow-2 under different pretraining methods (c-h). Plots of the structure database are color-coded by sequence length, whereas those of the generated structures are color-coded by scRMSD.
  • Figure 3: FoldFlow-2 generated structures (orange) compared against ProteinMPNN $\rightarrow$ ESMFold refolded structures (grey). Columns denote pretraining methods, and rows denote sequence lengths of 700 and 800.
  • Figure 4: Impact of translation and rotation noise scale on protein structure validity.