Protein Discovery with Discrete Walk-Jump Sampling

Nathan C. Frey; Daniel Berenberg; Karina Zadorozhny; Joseph Kleinhenz; Julien Lafrance-Vanasse; Isidro Hotzel; Yan Wu; Stephen Ra; Richard Bonneau; Kyunghyun Cho; Andreas Loukas; Vladimir Gligorijevic; Saeed Saremi

Protein Discovery with Discrete Walk-Jump Sampling

Nathan C. Frey, Daniel Berenberg, Karina Zadorozhny, Joseph Kleinhenz, Julien Lafrance-Vanasse, Isidro Hotzel, Yan Wu, Stephen Ra, Richard Bonneau, Kyunghyun Cho, Andreas Loukas, Vladimir Gligorijevic, Saeed Saremi

TL;DR

The paper tackles the challenge of training and sampling from discrete generative models for antibody discovery. It introduces Smoothed Discrete Sampling (SDS) and the discrete Walk-Jump Sampling (dWJS) framework, which decouples sampling (walk via Langevin MCMC on a discrete energy model) from denoising (jump via a neural denoiser), enabling a single noise level and robust sampling. A Distributional Conformity Score (DCS) is proposed to quantify biophysical validity and novelty, guiding model optimization. Empirically, dWJS achieves high expression (≈97.5%), generates diverse, novel antibodies in silico, and delivers strong functional performance in vitro, notably a 70% binding rate in trastuzumab CDR H3 redesign, surpassing diffusion-based and language-model baselines. This work demonstrates a practical, efficient framework for ab initio antibody discovery and design with potential applicability to other discrete biological sequences and molecular design tasks.

Abstract

We resolve difficulties in training and sampling from a discrete generative model by learning a smoothed energy function, sampling from the smoothed data manifold with Langevin Markov chain Monte Carlo (MCMC), and projecting back to the true data manifold with one-step denoising. Our Discrete Walk-Jump Sampling formalism combines the contrastive divergence training of an energy-based model and improved sample quality of a score-based model, while simplifying training and sampling by requiring only a single noise level. We evaluate the robustness of our approach on generative modeling of antibody proteins and introduce the distributional conformity score to benchmark protein generative models. By optimizing and sampling from our models for the proposed distributional conformity score, 97-100% of generated samples are successfully expressed and purified and 70% of functional designs show equal or improved binding affinity compared to known functional antibodies on the first attempt in a single round of laboratory experiments. We also report the first demonstration of long-run fast-mixing MCMC chains where diverse antibody protein classes are visited in a single MCMC chain.

Protein Discovery with Discrete Walk-Jump Sampling

TL;DR

Abstract

Paper Structure (40 sections, 10 equations, 5 figures, 7 tables, 4 algorithms)

This paper contains 40 sections, 10 equations, 5 figures, 7 tables, 4 algorithms.

Introduction
Background
Energy-based models
Neural empirical Bayes
Antibody discovery and design
Discrete walk-jump sampling
Variable length protein sequence generation.
Protein design vs discovery.
Derivation of optimal noise level for discrete sequence data
Distributional conformity score
Experiments
dWJS generates natural, novel, diverse antibodies in silico
dWJS generates natural, novel, diverse antibodies in vitro
dWJS generates functional antibody variants in vitro
Related Work
...and 25 more sections

Figures (5)

Figure 1: Selected samples from a single Markov chain Monte Carlo sampling run of discrete Walk-Jump sampling (our method). Protein color corresponds to different antibody germlines (classes). Samples are folded with EquiFold lee2022equifold for visualization purposes. Discrete walk-jump sampling exhibits fast mixing and explores diverse modes of the distribution in a single chain.
Figure 2: Discrete walk-jump sampling. a The noising and denoising process is applied to antibody proteins. b Discrete inputs $x$ are smoothed with isotropic Gaussian noise, $\varepsilon \sim \mathcal{N}(0,\sigma^2 I_d)$, to noisy inputs, $y=x+\varepsilon$. A discrete energy-based model (dEBM) parameterizes the energy function $f_\theta(y)$ of noisy data. Noisy data is sampled with the energy function, and denoised with a separate denoising ByteNet network to clean samples, $\hat{x}_\phi(y)$. c The "walk" sampling steps on the noisy data manifold with Langevin MCMC are totally decoupled from the "jump" steps to clean samples. d The dEBM takes noisy inputs $y$, concatenates them with a 1d positional encoding, $p_{1d}$, passes through an MLP and a 3 layer CNN, and concatenates the outputs with an embedding $z_s$ of the inputs into a hidden state, $h$. $h$ is passed through an MLP and returns the energy $f_\theta(y)$.
Figure 3: in silico designs sampled with dWJS are compared to a reference set of validation samples. Distributions are characterized with a set of sample quality metrics. Joint density estimation is used to compute the likelihood of designs versus the validation set and likelihoods are condensed into a distributional conformity score that characterizes the similarity of generated samples to the reference set.
Figure 4: Histogram of $\chi_{ii'}$ values for random samples from the paired observed antibody space olsen2022observed dataset.
Figure 5: Expression yield (mg) and binding affinity (pKD) of sequence designs from our method targeting the ERBB2 antigen.

Protein Discovery with Discrete Walk-Jump Sampling

TL;DR

Abstract

Protein Discovery with Discrete Walk-Jump Sampling

Authors

TL;DR

Abstract

Table of Contents

Figures (5)