Kermut: Composite kernel regression for protein variant effects

Peter Mørch Groth; Mads Herbert Kerrn; Lars Olsen; Jesper Salomon; Wouter Boomsma

Kermut: Composite kernel regression for protein variant effects

Peter Mørch Groth, Mads Herbert Kerrn, Lars Olsen, Jesper Salomon, Wouter Boomsma

TL;DR

A Gaussian process regression model, Kermut, with a novel composite kernel for modeling mutation similarity, which obtains state-of-the-art performance for supervised protein variant effect prediction while also offering estimates of uncertainty through its posterior.

Abstract

Reliable prediction of protein variant effects is crucial for both protein optimization and for advancing biological understanding. For practical use in protein engineering, it is important that we can also provide reliable uncertainty estimates for our predictions, and while prediction accuracy has seen much progress in recent years, uncertainty metrics are rarely reported. We here provide a Gaussian process regression model, Kermut, with a novel composite kernel for modeling mutation similarity, which obtains state-of-the-art performance for supervised protein variant effect prediction while also offering estimates of uncertainty through its posterior. An analysis of the quality of the uncertainty estimates demonstrates that our model provides meaningful levels of overall calibration, but that instance-specific uncertainty calibration remains more challenging.

Kermut: Composite kernel regression for protein variant effects

TL;DR

Abstract

Paper Structure (69 sections, 11 equations, 23 figures, 18 tables)

This paper contains 69 sections, 11 equations, 23 figures, 18 tables.

Introduction
Related work
Protein property prediction
Kernel methods for protein sequences
Uncertainty quantification and calibration
Local structural environments
Methods
Preliminaries
Gaussian processes
Kermut
Zero-shot mean function
Architecture considerations
Results
Ablation study
Uncertainty quantification per mutation domain
...and 54 more sections

Figures (23)

Figure 1: Overview of Kermut's structure kernel. Using an inverse folding model, structure-conditioned amino acid distributions are computed for all sites in the reference protein. The structure kernel yields high covariances between two variants if the local environments are similar, if the mutation probabilities are similar, and if the mutates sites are physically close. Constructed examples of expected covariances between variant $\bm{x}_1$ and $\bm{x}_{2,3,4}$ are shown.
Figure 2: Distribution of predictive variances for datasets with double mutants, grouped by domain. The three first elements correspond to the three split-schemes from ProteinGym. The third and fourth correspond to training on both single and double mutants, and testing on each, respectively. For the last column, we train on single and test on double mutants, corresponding to an extrapolation setting.
Figure 3: Calibration curves for Kermut using different methods. Mean ECE/ENCE values ($\pm2\sigma$) are shown. Dashed line ($x=y$) corresponds to ideal calibration. The row order corresponds to the ordering in \ref{['tab:subsets']}. (a) exhibits good calibration as indicated by curves close to the diagonal and ECE values close to zero, albeit with under-confident uncertainties in the second row. In (b), Kermut is also relatively well-calibrated, as indicated by the increasing curves, albeit with large variances along both axes. The low coefficients of variation ($c_v$) indicate similar predictive variances in each setting. Overall, Kermut achieves good calibration in most cases as a result of the designed kernel.
Figure I.1: Histogram over normalized assay values for 51/69 datasets with multi-mutants. All datasets with more than 7500 variants are ignored. The histograms are colored according to the number of mutations per variant. The assay distribution belong to different modalities depending on the number of mutations present, where double mutations commonly lead to a loss of fitness.
Figure J.1: Calibration metrics per domain for Kermut and the sequence kernel on ESM-2 embeddings. Random, modulo, and contiguous domains are from the ProteinGym substitution benchmark. Multiples corresponds to training and testing on both single and double mutants. Extrapolation corresponds to training on singles and predicting on doubles. 51 datasets with multi-mutants was used for the figure for all domains for comparability. The performance results for the multi-mutant setting can be found in \ref{['tab:multimutants']}. Errorbars correspond to standard error.
...and 18 more figures

Kermut: Composite kernel regression for protein variant effects

TL;DR

Abstract

Kermut: Composite kernel regression for protein variant effects

Authors

TL;DR

Abstract

Table of Contents

Figures (23)