Protein Design with Guided Discrete Diffusion

Nate Gruver; Samuel Stanton; Nathan C. Frey; Tim G. J. Rudner; Isidro Hotzel; Julien Lafrance-Vanasse; Arvind Rajpal; Kyunghyun Cho; Andrew Gordon Wilson

Protein Design with Guided Discrete Diffusion

Nate Gruver, Samuel Stanton, Nathan C. Frey, Tim G. J. Rudner, Isidro Hotzel, Julien Lafrance-Vanasse, Arvind Rajpal, Kyunghyun Cho, Andrew Gordon Wilson

TL;DR

This work addresses the challenge of designing protein sequences directly in sequence space by introducing diffusioN Optimized Sampling (NOS), a gradient-guided, discrete diffusion method. NOS enables controllable sampling in discrete spaces and is integrated with LaMBO-2 to perform multi-objective, edit-aware antibody design using saliency maps to select edit positions. The approach delivers improved objective-value versus likelihood trade-offs in silico and demonstrates strong experimental validation, achieving high expression and notable binding across multiple targets. Overall, NOS and LaMBO-2 offer a data-efficient, scalable framework for sequence-level protein design that can reduce reliance on costly structure-based methods and extensive screening.

Abstract

A popular approach to protein design is to combine a generative model with a discriminative model for conditional sampling. The generative model samples plausible sequences while the discriminative model guides a search for sequences with high fitness. Given its broad success in conditional sampling, classifier-guided diffusion modeling is a promising foundation for protein design, leading many to develop guided diffusion models for structure with inverse folding to recover sequences. In this work, we propose diffusioN Optimized Sampling (NOS), a guidance method for discrete diffusion models that follows gradients in the hidden states of the denoising network. NOS makes it possible to perform design directly in sequence space, circumventing significant limitations of structure-based methods, including scarce data and challenging inverse design. Moreover, we use NOS to generalize LaMBO, a Bayesian optimization procedure for sequence design that facilitates multiple objectives and edit-based constraints. The resulting method, LaMBO-2, enables discrete diffusions and stronger performance with limited edits through a novel application of saliency maps. We apply LaMBO-2 to a real-world protein design task, optimizing antibodies for higher expression yield and binding affinity to several therapeutic targets under locality and developability constraints, attaining a 99% expression rate and 40% binding rate in exploratory in vitro experiments.

Protein Design with Guided Discrete Diffusion

TL;DR

Abstract

Paper Structure (35 sections, 27 equations, 18 figures, 3 tables, 3 algorithms)

This paper contains 35 sections, 27 equations, 18 figures, 3 tables, 3 algorithms.

Introduction
Related Work
Background
Methods
NOS: diffusioN Optimized Sampling
LaMBO-2: function-guided protein design
Experiments
Unguided antibody CDR infilling
Optimizing antibodies for in silico objectives
Antibody lead optimization: in silico evaluation
Antibody lead optimization: in vitro evaluation
Discussion
Appendix
Extended Background
Continuous noise diffusion
...and 20 more sections

Figures (18)

Figure 1: We propose diffusioNOptimized Sampling (NOS), a method for gradient-guided sampling from discrete diffusion models. NOS uses $T$ sampling steps of denoising diffusion, where each step consists of applying a corruption, gradient steps to optimize a value function, $f$, and sampling of the next discrete sequence, or corresponding latent state. NOS generates samples that optimize an arbitrary objective while maintaining high likelihood with respect to a reference distribution of sequences. We combine NOS with LaMBO, a strong Bayesian optimization method for sequence design stanton2022accelerating, to make LaMBO-2, our improved method for protein design.
Figure 2: Two approaches to diffusion generative modeling for categorical variables. (Left) Categorical data is embedded into continuous variables with an accompanying continuous noise process. (Right) Categorical noise is applied directly to sequences, and corrupted sequences are denoised using standard language modeling methods.
Figure 3: An example of a binding affinity saliency map produced by LaMBO-2 with NOS-D. For simplicity, only the variable heavy (VH) region of the hu4D5 antibody is shown. Positions corresponding to complementarity defining regions (CDRs) are enclosed in green boxes. Converting this saliency map to an edit position distribution will concentrate computational resources on editing CDRH3, which is often manually selected by experts. Some resources are also allocated to the framework and other CDRs since these positions may also affect binding.
Figure 4: We infill antibody CDRs with discrete diffusion models (ours) and compare against structure-based diffusion models (DiffAb luo2022antigen and and RFDiffusion watson2022broadly) and an autoregressive antibody language model (IgLM shuai2021generative). We see diffusion on sequences alone--without structural priors--reliable leads to high sequence recovery. For structure based methods, we first fold seed sequences with IgFold ruffolo2022fast and then run joint sampling of sequence and structure for the CDR. We sample 10 infills for each of the 10 antibody seed sequences selected randomly from paired OAS olsen2022observed.
Figure 5: Comparing samples from NOS (ours) with alternative guided generation methods and structure-based models. NOS exhibits higher likelihood for similar or dramatically improved values of the objective. (left) Sequence diversification (resampling and selecting improved points) with DiffAb luo2022antigen or RFDiffusion watson2022broadly. DiffAb generates sequences and structures simultaneously, while sequences for RFDiffusion are obtained using ProteinMPNN dauparas2022robust. Compared with NOS, these methods do not effectively optimize the objective and yield low-likelihood sequences. (right) Guided generation using PPLM dathathri2019plug, a guidance method for autoregressive language models (in this case IgLM shuai2021generative) and DiGress, a competing guidance method for discrete diffusion models vignac2022digress. NOS, PPLM, and DiGress are sampled for many settings of guidance strength (e.g. $\eta$ and $\lambda$ (\ref{['eq:guidance_step']})) to demonstrate the full range of trade-offs between the objective and likelihood. We provide details about hyperparameter settings in Appendix \ref{['subsec:nos-hypers']} and additional density plots in Appendix \ref{['subsec:nos-density-plots']}.
...and 13 more figures

Protein Design with Guided Discrete Diffusion

TL;DR

Abstract

Protein Design with Guided Discrete Diffusion

Authors

TL;DR

Abstract

Table of Contents

Figures (18)