Table of Contents
Fetching ...

Representing local protein environments with atomistic foundation models

Meital Bojan, Sanketh Vedula, Advaith Maddipatla, Nadav Bojan Sellam, Federico Napoli, Paul Schanda, Alex M. Bronstein

TL;DR

This work proposes a novel representation for a local protein environment derived from the intermediate features of atomistic foundation models (AFMs) that effectively captures both local structure and chemical features, and enables a first-of-its-kind physics-informed chemical shift predictor that achieves state-of-the-art accuracy.

Abstract

The local structure of a protein strongly impacts its function and interactions with other molecules. Therefore, a concise, informative representation of a local protein environment is essential for modeling and designing proteins and biomolecular interactions. However, these environments' extensive structural and chemical variability makes them challenging to model, and such representations remain under-explored. In this work, we propose a novel representation for a local protein environment derived from the intermediate features of atomistic foundation models (AFMs). We demonstrate that this embedding effectively captures both local structure (e.g., secondary motifs), and chemical features (e.g., amino-acid identity and protonation state). We further show that the AFM-derived representation space exhibits meaningful structure, enabling the construction of data-driven priors over the distribution of biomolecular environments. Finally, in the context of biomolecular NMR spectroscopy, we demonstrate that the proposed representations enable a first-of-its-kind physics-informed chemical shift predictor that achieves state-of-the-art accuracy. Our results demonstrate the surprising effectiveness of atomistic foundation models and their emergent representations for protein modeling beyond traditional molecular simulations. We believe this will open new lines of work in constructing effective functional representations for protein environments.

Representing local protein environments with atomistic foundation models

TL;DR

This work proposes a novel representation for a local protein environment derived from the intermediate features of atomistic foundation models (AFMs) that effectively captures both local structure and chemical features, and enables a first-of-its-kind physics-informed chemical shift predictor that achieves state-of-the-art accuracy.

Abstract

The local structure of a protein strongly impacts its function and interactions with other molecules. Therefore, a concise, informative representation of a local protein environment is essential for modeling and designing proteins and biomolecular interactions. However, these environments' extensive structural and chemical variability makes them challenging to model, and such representations remain under-explored. In this work, we propose a novel representation for a local protein environment derived from the intermediate features of atomistic foundation models (AFMs). We demonstrate that this embedding effectively captures both local structure (e.g., secondary motifs), and chemical features (e.g., amino-acid identity and protonation state). We further show that the AFM-derived representation space exhibits meaningful structure, enabling the construction of data-driven priors over the distribution of biomolecular environments. Finally, in the context of biomolecular NMR spectroscopy, we demonstrate that the proposed representations enable a first-of-its-kind physics-informed chemical shift predictor that achieves state-of-the-art accuracy. Our results demonstrate the surprising effectiveness of atomistic foundation models and their emergent representations for protein modeling beyond traditional molecular simulations. We believe this will open new lines of work in constructing effective functional representations for protein environments.

Paper Structure

This paper contains 47 sections, 17 equations, 16 figures, 14 tables, 1 algorithm.

Figures (16)

  • Figure 1: Proposed construction of canonical local protein environment descriptors and their use. Machine learning force field (MLFF) models are pre-trained as energy and force regressors on databases of DFT-calculated energies, enabling them to learn zeroth, first, and second order latent representations of interatomic interactions. We embed local protein environments by extracting latent embeddings from pre-trained MLFFs for a focus residue and all atoms within a $5$Å radius of the residue. These embeddings are then mapped onto the atoms of the focus residue to construct canonical environment descriptors which can be used in downstream models for transfer learning to predict diverse chemical properties.
  • Figure 2: MACE embedding space reveals meaningful structural and chemical information. Depicted are two-dimensional UMAP coordinates of $165,913$ protein environments from $1327$ non-redundant chains predicted by AlphaFold2 jumper2021highly, labeled left-to-right, top-to-bottom according to the DSSP secondary structure class (Table \ref{['tab:ss_codes']}), amino acid chemical identity, the pair of backbone dihedral angles $(\phi, \psi)$, and CA secondary chemical shift (relative to a random coil).
  • Figure 3: Chemical shift prediction errors for different atom types evaluated on a test set of $132,228$ environments from $203$ non-redundant BMRB records with experimentally determined chemical shifts used as the reference. The median prediction error in ppm and the $25\%-75\%$ (boxes) and $5\%-95\%$ (whiskers) confidence intervals are depicted.
  • Figure 4: Synthetic example showing the influence of a phenylalanine sidechain aromatic ring on surrounding chemical shifts.A. The magnitude of change of backbone CA chemical shifts over different ring orientations as predicted using the proposed MACE-based predictor (left) and UCBShift-X (right). A $7$Å sphere indicates the radius from the ring center at which the influence of the ring current is expected to become negligible. Note that UCBShift-X predicts much longer-range, albeit small, ring influence extending beyond $20$Å. B. Locations of three nearby CA atoms; and C. their predicted chemical shifts vs. the ring orientation. Note the smooth $180^\circ$-periodic behavior of the MACE shift prediction and the decay of the effect scale with the distance from the ring.
  • Figure 5: Lower likelihood environments result in larger chemical shift prediction error. Chemical shift prediction accuracy of CA (left) and N (right) atoms stratified by the KDE-estimated likelihood of the corresponding MACE descriptors. Depicted are the median and $25\%$-$50\%$ confidence intervals. Higher-likelihood environments correspond to lower prediction error and can be used as an uncertainty measure.
  • ...and 11 more figures