Table of Contents
Fetching ...

Internal-Coordinate Density Modelling of Protein Structure: Covariance Matters

Marloes Arts, Jes Frellsen, Wouter Boomsma

TL;DR

The paper tackles the challenge of modelling distributions over protein structures by operating in internal coordinates κ, where direct covariance estimation is difficult due to global constraints. It introduces a covariance-inducing strategy that imposes constraints on downstream Cartesian fluctuations via a Lagrange-multiplier framework, enabling a full and tractable covariance structure Σ̃_{κ} in the internal coordinates. This is implemented as a variational autoencoder in which a U‑Net predicts per-atom λ values to shape the constraint-induced covariance, with pNeRF converting κ-means to Cartesian coordinates and enabling sampling through ancestral methods. The approach is demonstrated in two regimes—unimodal low-data and multimodal high-data—showing meaningful density estimates and competitive performance against baselines, with code and data to be released; this method offers a scalable path toward density modelling of full protein backbones in internal coordinates and could inform future diffusion or hierarchical models.

Abstract

After the recent ground-breaking advances in protein structure prediction, one of the remaining challenges in protein machine learning is to reliably predict distributions of structural states. Parametric models of fluctuations are difficult to fit due to complex covariance structures between degrees of freedom in the protein chain, often causing models to either violate local or global structural constraints. In this paper, we present a new strategy for modelling protein densities in internal coordinates, which uses constraints in 3D space to induce covariance structure between the internal degrees of freedom. We illustrate the potential of the procedure by constructing a variational autoencoder with full covariance output induced by the constraints implied by the conditional mean in 3D, and demonstrate that our approach makes it possible to scale density models of internal coordinates to full protein backbones in two settings: 1) a unimodal setting for proteins exhibiting small fluctuations and limited amounts of available data, and 2) a multimodal setting for larger conformational changes in a high data regime.

Internal-Coordinate Density Modelling of Protein Structure: Covariance Matters

TL;DR

The paper tackles the challenge of modelling distributions over protein structures by operating in internal coordinates κ, where direct covariance estimation is difficult due to global constraints. It introduces a covariance-inducing strategy that imposes constraints on downstream Cartesian fluctuations via a Lagrange-multiplier framework, enabling a full and tractable covariance structure Σ̃_{κ} in the internal coordinates. This is implemented as a variational autoencoder in which a U‑Net predicts per-atom λ values to shape the constraint-induced covariance, with pNeRF converting κ-means to Cartesian coordinates and enabling sampling through ancestral methods. The approach is demonstrated in two regimes—unimodal low-data and multimodal high-data—showing meaningful density estimates and competitive performance against baselines, with code and data to be released; this method offers a scalable path toward density modelling of full protein backbones in internal coordinates and could inform future diffusion or hierarchical models.

Abstract

After the recent ground-breaking advances in protein structure prediction, one of the remaining challenges in protein machine learning is to reliably predict distributions of structural states. Parametric models of fluctuations are difficult to fit due to complex covariance structures between degrees of freedom in the protein chain, often causing models to either violate local or global structural constraints. In this paper, we present a new strategy for modelling protein densities in internal coordinates, which uses constraints in 3D space to induce covariance structure between the internal degrees of freedom. We illustrate the potential of the procedure by constructing a variational autoencoder with full covariance output induced by the constraints implied by the conditional mean in 3D, and demonstrate that our approach makes it possible to scale density models of internal coordinates to full protein backbones in two settings: 1) a unimodal setting for proteins exhibiting small fluctuations and limited amounts of available data, and 2) a multimodal setting for larger conformational changes in a high data regime.
Paper Structure (40 sections, 10 equations, 11 figures, 3 tables)

This paper contains 40 sections, 10 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: A protein structure ensemble is modelled in internal coordinate space, while imposing constraints on atom fluctuations in Euclidean space. The resulting full covariance structure can be used to sample from a multivariate normal distribution.
  • Figure 2: When a standard estimator is used to get the precision structure over internal coordinates, resulting atom fluctuations significantly deviate from MD simulations. Blue arrows and red helices represent secondary structural elements. The variance is calculated as the mean of the variances over the x, y and z axis, in $\mathrm{\mathring{A}}^2$.
  • Figure 3: Model overview. The encoder (left) embeds internal coordinates into the latent space. The decoder (right) predicts a mean, from which constraints are extracted to obtain a precision matrix. Together with the $\kappa$-prior over the precision matrix based on the input data, a new precision matrix is formed which can be used to sample from a multivariate Gaussian.
  • Figure 4: Modelling fluctuations in the unimodal setting for 1pga, 1fsd, and 1unc. Left: structure visualization, with $\upalpha$-helices in red and $\upbeta$-sheets as blue arrows. Middle: Ramachandran plots for the MD reference and VAE samples. Right: variance along the atom chain for VAE samples, MD reference, and baselines. Secondary structure elements are indicated along the x-axis.
  • Figure 5: Modelling (un)folding behavior in the multimodal setting for cln025 and 2f4k. Left: structure visualization. Right: TICA free energy landscapes for MD reference, VAE, and baselines.
  • ...and 6 more figures