Table of Contents
Fetching ...

How to make the most of your masked language model for protein engineering

Calvin McCarter, Nick Bhattacharya, Sebastian W. Ober, Hunter Elliott

TL;DR

This work proposes a flexible, effective sampling method for masked language models (MLMs), and reports results from an extensive in vitro head-to-head evaluation for the antibody engineering setting, revealing that choice of sampling method is at least as impactful as the model used, motivating future research into this under-explored area.

Abstract

A plethora of protein language models have been released in recent years. Yet comparatively little work has addressed how to best sample from them to optimize desired biological properties. We fill this gap by proposing a flexible, effective sampling method for masked language models (MLMs), and by systematically evaluating models and methods both in silico and in vitro on actual antibody therapeutics campaigns. Firstly, we propose sampling with stochastic beam search, exploiting the fact that MLMs are remarkably efficient at evaluating the pseudo-perplexity of the entire 1-edit neighborhood of a sequence. Reframing generation in terms of entire-sequence evaluation enables flexible guidance with multiple optimization objectives. Secondly, we report results from our extensive in vitro head-to-head evaluation for the antibody engineering setting. This reveals that choice of sampling method is at least as impactful as the model used, motivating future research into this under-explored area.

How to make the most of your masked language model for protein engineering

TL;DR

This work proposes a flexible, effective sampling method for masked language models (MLMs), and reports results from an extensive in vitro head-to-head evaluation for the antibody engineering setting, revealing that choice of sampling method is at least as impactful as the model used, motivating future research into this under-explored area.

Abstract

A plethora of protein language models have been released in recent years. Yet comparatively little work has addressed how to best sample from them to optimize desired biological properties. We fill this gap by proposing a flexible, effective sampling method for masked language models (MLMs), and by systematically evaluating models and methods both in silico and in vitro on actual antibody therapeutics campaigns. Firstly, we propose sampling with stochastic beam search, exploiting the fact that MLMs are remarkably efficient at evaluating the pseudo-perplexity of the entire 1-edit neighborhood of a sequence. Reframing generation in terms of entire-sequence evaluation enables flexible guidance with multiple optimization objectives. Secondly, we report results from our extensive in vitro head-to-head evaluation for the antibody engineering setting. This reveals that choice of sampling method is at least as impactful as the model used, motivating future research into this under-explored area.
Paper Structure (28 sections, 6 equations, 7 figures, 1 table)

This paper contains 28 sections, 6 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Machine learning-guided iterative design of therapeutic antibodies. This work focuses on Step 3, wherein a MLM is combined with seed sequences and (optionally) predictive models, and proposes mutated sequences.
  • Figure 2: In vitro results. For success rate, we show 95% CIs via binomial test inversion.
  • Figure 3: Additional developability-related in silico evaluations of the sequences in the in vitro experiment. For each method, observations are split by successful synthesizability and binding.
  • Figure 4: Additional analysis of mutational preferences of different methods for the in vitro experiment. or each method, observations are split by successful synthesizability and binding.
  • Figure 5: In silico predicted probability of synthesizability for AbLang2 generated sequences, as a function of number of edits.
  • ...and 2 more figures