Improving Protein Optimization with Smoothed Fitness Landscapes

Andrew Kirjner; Jason Yim; Raman Samusevich; Shahar Bracha; Tommi Jaakkola; Regina Barzilay; Ila Fiete

Improving Protein Optimization with Smoothed Fitness Landscapes

Andrew Kirjner, Jason Yim, Raman Samusevich, Shahar Bracha, Tommi Jaakkola, Regina Barzilay, Ila Fiete

TL;DR

This work forms protein fitness as a graph signal then uses Tikunov regularization to smooth the fitness landscape, and finds optimizing in this smoothed landscape leads to improved performance across multiple methods in the GFP and AAV benchmarks.

Abstract

The ability to engineer novel proteins with higher fitness for a desired property would be revolutionary for biotechnology and medicine. Modeling the combinatorially large space of sequences is infeasible; prior methods often constrain optimization to a small mutational radius, but this drastically limits the design space. Instead of heuristics, we propose smoothing the fitness landscape to facilitate protein optimization. First, we formulate protein fitness as a graph signal then use Tikunov regularization to smooth the fitness landscape. We find optimizing in this smoothed landscape leads to improved performance across multiple methods in the GFP and AAV benchmarks. Second, we achieve state-of-the-art results utilizing discrete energy-based models and MCMC in the smoothed landscape. Our method, called Gibbs sampling with Graph-based Smoothing (GGS), demonstrates a unique ability to achieve 2.5 fold fitness improvement (with in-silico evaluation) over its training set. GGS demonstrates potential to optimize proteins in the limited data regime. Code: https://github.com/kirjner/GGS

Improving Protein Optimization with Smoothed Fitness Landscapes

TL;DR

Abstract

Paper Structure (28 sections, 8 equations, 5 figures, 11 tables, 5 algorithms)

This paper contains 28 sections, 8 equations, 5 figures, 11 tables, 5 algorithms.

Introduction
Related work
Method
Problem formulation
Graph-based smoothing on proteins
Sampling improved fitness with Gibbs
Clustered sampling.
Experiments
Benchmark
In-silico evaluation.
Results
Analysis
Graph size.
Smoothing.
Sampling convergence.
...and 13 more sections

Figures (5)

Figure 1: Overview. (A) Protein optimization is challenging due to a noisy fitness landscape where the starting dataset (unblurred) is a fraction of the landscape with the highest fitness sequences hidden (blurred). (B) We develop Graph-based Smoothing (GS) to estimate a smoothed fitness landscape from the starting data. (C) A model is trained on the smoothed fitness landscape to infer the rest of the landscape. (D) Gradients from the model are used in Gibbs With Gradients (GWG) where on each step a new mutation is proposed. (E) The goal of sampling is for each trajectory to gradually head towards higher fitness.
Figure 2: Steps in graph-based smoothing on proteins illustrated with a fictitious data of length 2 sequences with vocabulary $\{A, B\}$. Above each node are corresponding fitness values. Solid nodes are those in our training set while dashed nodes are augmented via point mutations to increase the smoothing effectiveness. See \ref{['sec:smoothing']} for description of each step.
Figure 3: GGS hyperparameter analysis on GFP and AAV hard difficulty. See \ref{['sec:analysis']}.
Figure 4: Easy is taken from design-bench where sequences between the 50-60th percentile are used in training regardless of edit distance to sequences in the 99th percentile. Data leakage is present due to multiple measurements that allows the wild-type and other top sequences to be included during training. Medium filters the training dataset to have sequences in the 20-40th percentile and be 6 or more mutations away from anything in the top 99th percentile. Hard similarly filters for sequences in at most the 30th percentile and 7 or more mutations away.
Figure 5: Illustration of clustered sampling.$\tilde{V}_r$ is the starting set of sequences for sampling in round $r$. GWG (\ref{['alg:big']}) is ran to generate many sample sequences, $V_{r+1}$. To control computation, we hierarchically cluster all sampled sequences based on Levenshtein distance and take the top fitness sequence in each cluster, using our trained fitness prediction model $f_\theta$ to score each sequence -- we refer to this subroutine as Reduce (\ref{['eq:reduce']}). The top sequences, $\tilde{V}_{r+1}$ are used for the next round.

Improving Protein Optimization with Smoothed Fitness Landscapes

TL;DR

Abstract

Improving Protein Optimization with Smoothed Fitness Landscapes

Authors

TL;DR

Abstract

Table of Contents

Figures (5)