Table of Contents
Fetching ...

Contrastive losses as generalized models of global epistasis

David H. Brookes, Jakub Otwinowski, Sam Sinai

TL;DR

This work argues by way of a fitness-epistasis uncertainty principle that the nonlinearities in global epistasis models can produce observed fitness functions that do not admit sparse representations, and thus may be inefficient to learn from observations when using a Mean Squared Error (MSE) loss.

Abstract

Fitness functions map large combinatorial spaces of biological sequences to properties of interest. Inferring these multimodal functions from experimental data is a central task in modern protein engineering. Global epistasis models are an effective and physically-grounded class of models for estimating fitness functions from observed data. These models assume that a sparse latent function is transformed by a monotonic nonlinearity to emit measurable fitness. Here we demonstrate that minimizing supervised contrastive loss functions, such as the Bradley-Terry loss, is a simple and flexible technique for extracting the sparse latent function implied by global epistasis. We argue by way of a fitness-epistasis uncertainty principle that the nonlinearities in global epistasis models can produce observed fitness functions that do not admit sparse representations, and thus may be inefficient to learn from observations when using a Mean Squared Error (MSE) loss (a common practice). We show that contrastive losses are able to accurately estimate a ranking function from limited data even in regimes where MSE is ineffective and validate the practical utility of this insight by demonstrating that contrastive loss functions result in consistently improved performance on benchmark tasks.

Contrastive losses as generalized models of global epistasis

TL;DR

This work argues by way of a fitness-epistasis uncertainty principle that the nonlinearities in global epistasis models can produce observed fitness functions that do not admit sparse representations, and thus may be inefficient to learn from observations when using a Mean Squared Error (MSE) loss.

Abstract

Fitness functions map large combinatorial spaces of biological sequences to properties of interest. Inferring these multimodal functions from experimental data is a central task in modern protein engineering. Global epistasis models are an effective and physically-grounded class of models for estimating fitness functions from observed data. These models assume that a sparse latent function is transformed by a monotonic nonlinearity to emit measurable fitness. Here we demonstrate that minimizing supervised contrastive loss functions, such as the Bradley-Terry loss, is a simple and flexible technique for extracting the sparse latent function implied by global epistasis. We argue by way of a fitness-epistasis uncertainty principle that the nonlinearities in global epistasis models can produce observed fitness functions that do not admit sparse representations, and thus may be inefficient to learn from observations when using a Mean Squared Error (MSE) loss (a common practice). We show that contrastive losses are able to accurately estimate a ranking function from limited data even in regimes where MSE is ineffective and validate the practical utility of this insight by demonstrating that contrastive loss functions result in consistently improved performance on benchmark tasks.
Paper Structure (25 sections, 4 theorems, 17 equations, 8 figures, 2 tables)

This paper contains 25 sections, 4 theorems, 17 equations, 8 figures, 2 tables.

Key Result

Theorem 1

Assume without loss of generality that p is sorted such that $p_i \geq p_{i+1}$ for $i=1,...,N-1$. Additionally assume that $q_i = 0$ if $p_i=0$ for $i$. If $w_i \geq w_{i+1}$ for all $i=1,...,N-1$, then $H(g(\textbf{f})) \leq H(\textbf{f})$.

Figures (8)

  • Figure 1: Recovery of latent fitness function from complete fitness data by minimizing Bradley-Terry loss. (a) Schematic of simulation. (b) Comparison between latent ($f$) and observed ($y$) fitness functions in fitness (left) and epistatic (right) domains. The latent fitness function is sampled from the NK model with $L=8$ and $K=2$ and the global epistasis function is $g(f)=\exp(10\cdot f)$. Each point in the scatter plot represents the fitness of a sequence, while each bar in the bar plot (right) represents the squared magnitude of an epistatic interaction (normalized such that all squared magnitudes sum to 1), with roman numerals indicating the order of interaction. Only epistatic interactions up to order 3 are shown. The right plot demonstrates that global epistasis produces a dense representation in the epistatic domain compared to the representation of the latent fitness in the epistatic domain. (c) Comparison between latent and estimated ($\hat{f}$) fitness functions in fitness and epistatic domains.
  • Figure 2: (a) Demonstration of the fitness-epistasis uncertainty principle for a latent fitness function transformed by $g(f) = \exp(a\cdot f)$ for various settings of $a$. The dashed black line indicates the lower bound on the sum of the entropies of the fitness and epistatic representations of the fitness function (b) Test-set Spearman correlation for models trained with MSE and BT losses on incomplete fitness data transformed by various nonlinearities, compared to the entropy of the fitness function in the epistatic domain. Each point corresponds to a model trained on 256 randomly sampled training points from an $L=10, K=2$ latent fitness function which was then transformed by a nonlinearity. (c) Convergence of models fit with BT and MSE losses to observed data generated by transforming an $L=10, K=2$ latent fitness function by $g(f) = \exp(10\cdot f)$. Each point represents the mean test set correlation over 200 training set replicates.
  • Figure 3: Results from multiple examples of the task of recovering a latent fitness function given complete observed data transformed by a global epistasis nonlinearity. Each sub-plot shows the results of one such task. The setting of $K$ used to sample the latent fitness function from the NK model and the particular form of the nonlinearity $g(f)$ used are indicated in each sub-plot title. The horizontal axis in each sub-plot represents the values of the latent fitness function, while the vertical axis represents the values of either the observed data (blue dots) or model predictions (red dots). For ease of plotting, all fitness functions were normalized to have an empirical mean and std. dev. of 1, respectively. The $R^2$ correlation between the latent fitness function and the model predictions are indicated in red text.
  • Figure 4: Demonstration of the fitness-epistasis uncertainty principle for multiple examples of nonlinearities. The title of the subplot indicates the nonlinearity used to produce the results in that subplot. The lines and shaded regions represent the mean and std. dev. of entropies, respectively, across 200 replicates of latent fitness functions sampled from the NK model. The black dotted line indicates the lower bound on the sum of the entropies in Eq. \ref{['eq: uncertainty_principle']}.
  • Figure 5: Results from incomplete data simulations for latent fitness functions drawn from the NK model with (a) $K=1$ and (b) $K=2$. Plot descriptions are as in Figure \ref{['fig:incomplete_sims']}b.
  • ...and 3 more figures

Theorems & Definitions (8)

  • Theorem 1
  • Lemma 1
  • Lemma 2
  • Corollary 1
  • proof : Proof of Theorem 1
  • proof : Proof of Lemma \ref{['lem:shannon_deriv']}
  • proof : Proof of Lemma \ref{['lem:cov_rearrange']}
  • proof : Proof of Corollary \ref{['corr:cov_is_positive']}