Table of Contents
Fetching ...

A Diffusion Model to Shrink Proteins While Maintaining Their Function

Ethan Baron, Alan N. Amin, Ruben Weitzman, Debora Marks, Andrew Gordon Wilson

TL;DR

This work tackles the challenge of shrinking long protein sequences without losing function by introducing SCISOR, a discrete diffusion model that learns deletions by reversing an insertion-only forward process. SCISOR scales to large evolutionary datasets (e.g., UniRef) and provides a principled training objective with alignment-based targets, achieving competitive likelihoods with other diffusion models and state-of-the-art performance in predicting deletion effects on ProteinGym. It demonstrates practical benefits by shrinking proteins to shorter, natural-looking sequences that better preserve structural motifs and functional sites compared with prior approaches. The model supports unconditional generation of natural-like proteins and enables targeted shrinking with improved motif preservation, and the authors release code and weights for multiple model scales to support broad use in protein design workflows.

Abstract

Many proteins useful in modern medicine or bioengineering are challenging to make in the lab, fuse with other proteins in cells, or deliver to tissues in the body, because their sequences are too long. Shortening these sequences typically involves costly, time-consuming experimental campaigns. Ideally, we could instead use modern models of massive databases of sequences from nature to learn how to propose shrunken proteins that resemble sequences found in nature. Unfortunately, these models struggle to efficiently search the combinatorial space of all deletions, and are not trained with inductive biases to learn how to delete. To address this gap, we propose SCISOR, a novel discrete diffusion model that deletes letters from sequences to generate protein samples that resemble those found in nature. To do so, SCISOR trains a de-noiser to reverse a forward noising process that adds random insertions to natural sequences. As a generative model, SCISOR fits evolutionary sequence data competitively with previous large models. In evaluation, SCISOR achieves state-of-the-art predictions of the functional effects of deletions on ProteinGym. Finally, we use the SCISOR de-noiser to shrink long protein sequences, and show that its suggested deletions result in significantly more realistic proteins and more often preserve functional motifs than previous models of evolutionary sequences.

A Diffusion Model to Shrink Proteins While Maintaining Their Function

TL;DR

This work tackles the challenge of shrinking long protein sequences without losing function by introducing SCISOR, a discrete diffusion model that learns deletions by reversing an insertion-only forward process. SCISOR scales to large evolutionary datasets (e.g., UniRef) and provides a principled training objective with alignment-based targets, achieving competitive likelihoods with other diffusion models and state-of-the-art performance in predicting deletion effects on ProteinGym. It demonstrates practical benefits by shrinking proteins to shorter, natural-looking sequences that better preserve structural motifs and functional sites compared with prior approaches. The model supports unconditional generation of natural-like proteins and enables targeted shrinking with improved motif preservation, and the authors release code and weights for multiple model scales to support broad use in protein design workflows.

Abstract

Many proteins useful in modern medicine or bioengineering are challenging to make in the lab, fuse with other proteins in cells, or deliver to tissues in the body, because their sequences are too long. Shortening these sequences typically involves costly, time-consuming experimental campaigns. Ideally, we could instead use modern models of massive databases of sequences from nature to learn how to propose shrunken proteins that resemble sequences found in nature. Unfortunately, these models struggle to efficiently search the combinatorial space of all deletions, and are not trained with inductive biases to learn how to delete. To address this gap, we propose SCISOR, a novel discrete diffusion model that deletes letters from sequences to generate protein samples that resemble those found in nature. To do so, SCISOR trains a de-noiser to reverse a forward noising process that adds random insertions to natural sequences. As a generative model, SCISOR fits evolutionary sequence data competitively with previous large models. In evaluation, SCISOR achieves state-of-the-art predictions of the functional effects of deletions on ProteinGym. Finally, we use the SCISOR de-noiser to shrink long protein sequences, and show that its suggested deletions result in significantly more realistic proteins and more often preserve functional motifs than previous models of evolutionary sequences.

Paper Structure

This paper contains 69 sections, 12 theorems, 47 equations, 14 figures, 9 tables, 4 algorithms.

Key Result

Theorem 4.1

(Proof in App. app: proof x_1 gap) Say $X_0$ is a sequence with length $L$. Call $q(\cdot\mid L)$ a distribution over sequences of length $L$ which simply samples each letter independently from $\mathrm{Cat}(\pi).$ Then, as the number of insertions increases, $M_1\to\infty$, $X_1$ becomes easier to

Figures (14)

  • Figure 1: SCISOR is a diffusion model trained to make deletions that arrive at a natural protein sequence. We can use SCISOR to shrink proteins while maintaining their function. (a) We add random insertions to protein sequences from nature and train SCISOR to reverse these insertions. (b) Applying SCISOR diffusion to natural proteins, we get smaller proteins that are predicted to preserve parts of the tertiary structures of the original sequence. We show SCISOR samples of Q8NFU3 at 0, 5, 10, 20, and 50% deletion with structures predicted by OmegaFold Wu2022-ma.
  • Figure 2: To calculate our target distribution of what letter to delete, $p(\mathrm{prev}(X_t)\mid X_0, X_t, M_t)$, we align our starting sequence $X_0$ to our noised sequence $X_t$. The reverse process should favor deleting letters that are gaps in more of the alignments.
  • Figure 3: The SCISOR de-noiser $q_\theta$ plans deletions to arrive at sequences that resemble those in nature, and therefore avoids deleting important structural motifs in natural sequences. (a) SCISOR unconditionally samples proteins by starting with a large random sequence $X_1$ (light blue) and iteratively deleting according to $q_\theta(\mathrm{prev}(X)|X, M)$, to arrive at a protein that resembles those in nature (dark blue). We predict the structure of each sequence with OmegaFold Wu2022-ma. (b) We ask SCISOR to plan the first of $M$ mutations for R4SNK4 and color residue $i$ on a structure from Aleku2016-jw by the deletion probability $q_\theta(X^{(-i)}|X, M)$. As $M$ increases, SCISOR favors deletions (red) in more regions while minimizing deletions in the catalytic structural motif near the bottom (white).
  • Figure 4: SCISOR fits the distribution of sequences in nature competitively with established sequence modeling approaches. (a) SCISOR is competitive with other diffusion models (grey) in perplexity. "S, M, L" refer to model size. (b, c) Samples from SCISOR ($K=5$) are predicted to be competitive quality to those from diffusion models and competitive with AR models as measured by (b) matching the distribution of natural sequences as measured by the Fréchet protein distance (FPD) and (c) foldability (higher pLDDT from OmegaFold Wu2022-ma). We took EvoDiff and AR perplexities from Alamdari2023-nj.
  • Figure 5: SCISOR makes state-of-the-art predictions for the effect of deletions on protein function measured in the lab. We calculate the average Spearman correlation between predicted deletion effects and measurements across all assays in ProteinGym, presenting the results from the highest-performing variant of each model architecture. Models that use multiple sequence alignment information are striped.
  • ...and 9 more figures

Theorems & Definitions (20)

  • Theorem 4.1
  • Theorem 4.2
  • Proposition 4.3
  • Proposition A.1
  • Theorem F.1
  • proof
  • Corollary F.2
  • proof
  • Theorem F.3
  • proof
  • ...and 10 more