Table of Contents
Fetching ...

Zero-shot protein stability prediction by inverse folding models: a free energy interpretation

Jes Frellsen, Maher M. Kassem, Tone Bengtsen, Lars Olsen, Kresten Lindorff-Larsen, Jesper Ferkinghoff-Borg, Wouter Boomsma

TL;DR

The work tackles the problem of interpreting zero-shot protein stability predictions from inverse folding models through a thermodynamic lens. It derives a formal link between changes in thermodynamic stability, $\beta\Delta\Delta G$, and inverse-folding posteriors, showing how the common log-odds approach emerges under specific approximations. The authors propose multiple refinements, including explicit unfolded-state modeling and multi-structure sampling, and demonstrate that these simple modifications can yield measurable gains across several benchmark datasets. They also present scalable strategies, such as BioEmu, to approximate structural ensembles without expensive simulations. Overall, the paper provides a principled framework to improve zero-shot stability prediction by integrating unfolded-state contributions and ensemble information, with broad implications for protein design and variant interpretation.

Abstract

Inverse folding models have proven to be highly effective zero-shot predictors of protein stability. Despite this success, the link between the amino acid preferences of an inverse folding model and the free-energy considerations underlying thermodynamic stability remains incompletely understood. A better understanding would be of interest not only from a theoretical perspective, but also potentially provide the basis for stronger zero-shot stability prediction. In this paper, we take steps to clarify the free-energy foundations of inverse folding models. Our derivation reveals the standard practice of likelihood ratios as a simplistic approximation and suggests several paths towards better estimates of the relative stability. We empirically assess these approaches and demonstrate that considerable gains in zero-shot performance can be achieved with fairly simple means.

Zero-shot protein stability prediction by inverse folding models: a free energy interpretation

TL;DR

The work tackles the problem of interpreting zero-shot protein stability predictions from inverse folding models through a thermodynamic lens. It derives a formal link between changes in thermodynamic stability, , and inverse-folding posteriors, showing how the common log-odds approach emerges under specific approximations. The authors propose multiple refinements, including explicit unfolded-state modeling and multi-structure sampling, and demonstrate that these simple modifications can yield measurable gains across several benchmark datasets. They also present scalable strategies, such as BioEmu, to approximate structural ensembles without expensive simulations. Overall, the paper provides a principled framework to improve zero-shot stability prediction by integrating unfolded-state contributions and ensemble information, with broad implications for protein design and variant interpretation.

Abstract

Inverse folding models have proven to be highly effective zero-shot predictors of protein stability. Despite this success, the link between the amino acid preferences of an inverse folding model and the free-energy considerations underlying thermodynamic stability remains incompletely understood. A better understanding would be of interest not only from a theoretical perspective, but also potentially provide the basis for stronger zero-shot stability prediction. In this paper, we take steps to clarify the free-energy foundations of inverse folding models. Our derivation reveals the standard practice of likelihood ratios as a simplistic approximation and suggests several paths towards better estimates of the relative stability. We empirically assess these approaches and demonstrate that considerable gains in zero-shot performance can be achieved with fairly simple means.

Paper Structure

This paper contains 38 sections, 30 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Correlation coefficients obtained using the different expressions discussed in the paper, all involving the inverse-folding model ESM-IF. The top-left expression is the approach typically employed as a zero-shot predictor of protein stability. The left column contains methods that consider only a single folded structure, while the right column considers a structural ensemble from an MD simulation. The different rows represent increasingly accurate approximations to the $\beta\Delta\Delta G$ (see text for details). The error bars represent the standard error of the mean calculated using 100 bootstrap samples. For simplicity, we used the bracket notation $\langle \cdot \rangle_S$ to denote the expectation $\mathbb{E}_{\bm{x} \sim p_\theta(\bm{x} | S, \bm{a},\beta)} [\cdot]$.
  • Figure 2: Scaling using BioEmu lewis2024scalable. When replacing MD simulations of the folded state with structural ensembles generated using BioEmu (20 samples), we still observe consistent improvements over the single-structure performance on this subset of the mega-scale data set tsuboyama2023mega. This suggests that learned generators of molecular ensembles offer a promising route to scaling our ensemble-based approach in practice.
  • Figure 3: Stability change calculated from the two pseudo–free-energy terms, $\beta\Delta \tilde{G}_{\bm{a}' \to \bm{a}}^{S}$, plotted as a function of $p(\mathop{\mathrm{F}}\nolimits | \bm{a}',\beta)$ for $p(\mathop{\mathrm{F}}\nolimits | \bm{a},\beta) = 0.95$. As discussed in \ref{['sec:simplified']}, we observe that when $p(\mathop{\mathrm{F}}\nolimits | \bm{a}',\beta)$ is around $0.95$, the unfolded-state term, $\beta \Delta \tilde{G}_{\bm{a}' \to \bm{a}}^{\mathop{\mathrm{U}}\nolimits}$, provides the dominant contribution to the stability change, $\beta\Delta\Delta G_{\bm{a} \to \bm{a}'}$, whereas for $p(\mathop{\mathrm{F}}\nolimits | \bm{a}', \beta) < p(\mathop{\mathrm{U}}\nolimits | \bm{a}, \beta)$, the folded-state term, $\beta \Delta \tilde{G}_{\bm{a}' \to \bm{a}}^{\mathop{\mathrm{F}}\nolimits}$, dominates. We also observe that the stability change, $\beta\Delta\Delta G_{\bm{a} \to \bm{a}'}$, is a monotonic function of the variant's folding probability, $p(\mathop{\mathrm{F}}\nolimits | \bm{a}',\beta)$, as discussed in \ref{['sec:ranking', 'sec:ranking details']}.
  • Figure 4: A breakdown of the performance for the individual proteins within the three datasets. Since correlations are computed, only proteins with at least 20 variant observations are included. The top-left variant is the approach typically employed as zero-shot predictor for protein stability prediction. The left column are methods based that consider only a single folded structure, while the right column considers a structural ensemble from an MD simulation. Note the considerable variation among the proteins in the Guerois set.
  • Figure 5: Pearson and Spearman correlations behave similarly. As expected from our derivations, the relationship between zero-shot scores and stability is linear, and employing a rank-based procedure like Spearman rho is therefore not necessary.
  • ...and 1 more figures