Table of Contents
Fetching ...

Data-Error Scaling Laws in Machine Learning on Combinatorial Mutation-prone Sets: Proteins and Small Molecules

Vanni Doffini, O. Anatole von Lilienfeld, Michael A. Nash

TL;DR

Problem: how data scarcity and discreteness in mutational spaces shape ML data-error scaling. Approach: a unified workflow applying kernel ridge regression with a Laplacian kernel to synthetic fitness functions, EvoEF binding energies, solvation energies, and a GB1 deep mutational scan, with multiple encodings and two shuffling schemes, plus a new LC normalization. Contributions: (i) identification of phase-transition-like, discontinuous learning curves featuring saturated and asymptotic decay regimes; (ii) demonstration that encoding and shuffling control LC shape and cluster structure; (iii) validation on experimental GB1 data showing similar behavior under mutant-based shuffling; (iv) practical guidance for design-of-experiments in mutational studies. Significance: informs efficient mutational library design and extends statistical learning theory for discrete combinatorial inputs.

Abstract

We investigate trends in the data-error scaling laws of machine learning (ML) models trained on discrete combinatorial spaces that are prone-to-mutation, such as proteins or organic small molecules. We trained and evaluated kernel ridge regression machines using variable amounts of computational and experimental training data. Our synthetic datasets comprised i) two naïve functions based on many-body theory; ii) binding energy estimates between a protein and a mutagenised peptide; and iii) solvation energies of two 6-heavy atom structural graphs, while the experimental dataset consisted of a full deep mutational scan of the binding protein GB1. In contrast to typical data-error scaling laws, our results showed discontinuous monotonic phase transitions during learning, observed as rapid drops in the test error at particular thresholds of training data. We observed two learning regimes, which we call saturated and asymptotic decay, and found that they are conditioned by the level of complexity (i.e. number of mutations) enclosed in the training set. We show that during training on this class of problems, the predictions were clustered by the ML models employed in the calibration plots. Furthermore, we present an alternative strategy to normalize learning curves (LCs) and introduce the concept of mutant-based shuffling. This work has implications for machine learning on mutagenisable discrete spaces such as chemical properties or protein phenotype prediction, and improves basic understanding of concepts in statistical learning theory.

Data-Error Scaling Laws in Machine Learning on Combinatorial Mutation-prone Sets: Proteins and Small Molecules

TL;DR

Problem: how data scarcity and discreteness in mutational spaces shape ML data-error scaling. Approach: a unified workflow applying kernel ridge regression with a Laplacian kernel to synthetic fitness functions, EvoEF binding energies, solvation energies, and a GB1 deep mutational scan, with multiple encodings and two shuffling schemes, plus a new LC normalization. Contributions: (i) identification of phase-transition-like, discontinuous learning curves featuring saturated and asymptotic decay regimes; (ii) demonstration that encoding and shuffling control LC shape and cluster structure; (iii) validation on experimental GB1 data showing similar behavior under mutant-based shuffling; (iv) practical guidance for design-of-experiments in mutational studies. Significance: informs efficient mutational library design and extends statistical learning theory for discrete combinatorial inputs.

Abstract

We investigate trends in the data-error scaling laws of machine learning (ML) models trained on discrete combinatorial spaces that are prone-to-mutation, such as proteins or organic small molecules. We trained and evaluated kernel ridge regression machines using variable amounts of computational and experimental training data. Our synthetic datasets comprised i) two naïve functions based on many-body theory; ii) binding energy estimates between a protein and a mutagenised peptide; and iii) solvation energies of two 6-heavy atom structural graphs, while the experimental dataset consisted of a full deep mutational scan of the binding protein GB1. In contrast to typical data-error scaling laws, our results showed discontinuous monotonic phase transitions during learning, observed as rapid drops in the test error at particular thresholds of training data. We observed two learning regimes, which we call saturated and asymptotic decay, and found that they are conditioned by the level of complexity (i.e. number of mutations) enclosed in the training set. We show that during training on this class of problems, the predictions were clustered by the ML models employed in the calibration plots. Furthermore, we present an alternative strategy to normalize learning curves (LCs) and introduce the concept of mutant-based shuffling. This work has implications for machine learning on mutagenisable discrete spaces such as chemical properties or protein phenotype prediction, and improves basic understanding of concepts in statistical learning theory.
Paper Structure (17 sections, 16 equations, 21 figures, 3 tables, 1 algorithm)

This paper contains 17 sections, 16 equations, 21 figures, 3 tables, 1 algorithm.

Figures (21)

  • Figure 1: Size of cumulative combinatorial space for a linear graph (e.g., a peptide) and ideal LC. (A) The number of cumulative combinations ($D_{tot}^{cont}$) is shown while changing the length of the linear graph ($n$, number of amino acids in chain) and the number of mutations in the sequence ($\hat{m}$). The vocabulary size was kept constant (20). (B) Example of an idealized LC (i.e., following a single power-law) for a classic ML problem with non-mutation-prone inputs. The effect of homoscedastic noise (dashed line) on the ideal behavior is shown: dotted line = no noise; solid line = with noise. Numerical values on the axes are intentionally omitted to highlight the illustrative and idealized nature of this subfigure.
  • Figure 2: Workflow overview. (A) Database Generation: a table containing all possible mutagenized peptide variants was generated from a starting construct (WT) and a mutational vocabulary. The response variable (binding energy) was computed for each entry. (B) Encoding: the database was converted into a matrix containing numerical values using binary flattened one hot encoding. (C) Machine learning: Laplacian kernel machines were trained using different quantities of data and different shuffling strategies. (D) Evaluation: LCs and calibration plots were used to study the learning process. The amount of information used during training in the scatter plots is reported on the figure.
  • Figure 3: LCs for different discretized spaces and functions. (A) 1-body (left) and 2-body (right) naïve functions. (B) Fg-$\beta$ (red) / S. epidermidis adhesin SdrG (greyscale) complex. (C) Binding energy function results using different shuffling. (D) Solvation energies results on different structures (6 heavy atoms linear, cyclic or the combination of both). The shuffling strategies are specified when relevant. If not explicitly specified, mutant-based shuffling is applied to the whole dataset. The presence of WT in the training set is reported in red only if the shuffling strategy did not automatically set it as the first entry.
  • Figure 4: Calibration plots of Fg-$\beta$ / S. epidermidis adhesin SdrG complex binding energy function. (A) Scatter plots showing true ($y$) vs. predicted ($\hat{y}$) energies at different numbers of mutations ($m$, first row) and training instances ($N_{training}^{norm}$, second row). Insets: zoom-in. (B) Rotation of the predictions coming from a ML model trained with the WT and all single mutants and tested on all quintuple mutants (shown in panel A, 1st row, 2nd column). Insets: amino acids frequencies accordingly to their cluster positions.
  • Figure 5: Impact of WT sequence being included in the training data on learning Fg-$\beta$ / SdrG complex binding energy function (EvoEF). (A) Extended LCs (hyperparameter, $\sigma$) of single examples using different shuffling strategies (see Fig. \ref{['fig:workflow_overview']}). The dashed red lines mark the point at which the WT sequence was included in the training set. Insets: standard LCs of the specific replicates analysed. (B) 3D projection of the plots shown in panel A.
  • ...and 16 more figures