Table of Contents
Fetching ...

Importance Weighted Expectation-Maximization for Protein Sequence Design

Zhenqiao Song, Lei Li

TL;DR

IsEM-Pro introduces a structure-enhanced latent generative framework for protein sequence design, combining a latent Transformer-based generator with Markov random field constraints and learning via importance-weighted Monte Carlo EM. By sampling in latent space and leveraging combinatorial structure, it achieves higher fitness, diversity, and novelty across eight benchmarks compared with strong baselines, including CMA-ES, DbAS, CbAS, and GFlowNet variants. The approach includes theoretical and empirical analysis showing close alignment between the proposal $Q_{\phi}(x)$ and the posterior $P_{\theta}(x|\mathcal{S})$, and demonstrates case studies where designed proteins exhibit plausible folding and structural similarity to known proteins. Overall, IsEM-Pro provides a scalable, principled framework for efficient, diverse protein sequence design with practical implications for accelerated wet-lab discovery.

Abstract

Designing protein sequences with desired biological function is crucial in biology and chemistry. Recent machine learning methods use a surrogate sequence-function model to replace the expensive wet-lab validation. How can we efficiently generate diverse and novel protein sequences with high fitness? In this paper, we propose IsEM-Pro, an approach to generate protein sequences towards a given fitness criterion. At its core, IsEM-Pro is a latent generative model, augmented by combinatorial structure features from a separately learned Markov random fields (MRFs). We develop an Monte Carlo Expectation-Maximization method (MCEM) to learn the model. During inference, sampling from its latent space enhances diversity while its MRFs features guide the exploration in high fitness regions. Experiments on eight protein sequence design tasks show that our IsEM-Pro outperforms the previous best methods by at least 55% on average fitness score and generates more diverse and novel protein sequences.

Importance Weighted Expectation-Maximization for Protein Sequence Design

TL;DR

IsEM-Pro introduces a structure-enhanced latent generative framework for protein sequence design, combining a latent Transformer-based generator with Markov random field constraints and learning via importance-weighted Monte Carlo EM. By sampling in latent space and leveraging combinatorial structure, it achieves higher fitness, diversity, and novelty across eight benchmarks compared with strong baselines, including CMA-ES, DbAS, CbAS, and GFlowNet variants. The approach includes theoretical and empirical analysis showing close alignment between the proposal and the posterior , and demonstrates case studies where designed proteins exhibit plausible folding and structural similarity to known proteins. Overall, IsEM-Pro provides a scalable, principled framework for efficient, diverse protein sequence design with practical implications for accelerated wet-lab discovery.

Abstract

Designing protein sequences with desired biological function is crucial in biology and chemistry. Recent machine learning methods use a surrogate sequence-function model to replace the expensive wet-lab validation. How can we efficiently generate diverse and novel protein sequences with high fitness? In this paper, we propose IsEM-Pro, an approach to generate protein sequences towards a given fitness criterion. At its core, IsEM-Pro is a latent generative model, augmented by combinatorial structure features from a separately learned Markov random fields (MRFs). We develop an Monte Carlo Expectation-Maximization method (MCEM) to learn the model. During inference, sampling from its latent space enhances diversity while its MRFs features guide the exploration in high fitness regions. Experiments on eight protein sequence design tasks show that our IsEM-Pro outperforms the previous best methods by at least 55% on average fitness score and generates more diverse and novel protein sequences.
Paper Structure (37 sections, 1 theorem, 24 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 37 sections, 1 theorem, 24 equations, 6 figures, 7 tables, 1 algorithm.

Key Result

Lemma 3.1

If the KL divergence between two distributions P and Q is less than a small positive value $\delta$, then the sampling probability difference between P and Q will be bounded by $\sqrt{2\delta}$ for each sample.

Figures (6)

  • Figure 1: Protein Fitness Landscape: distribution of a functional property for proteins). Protein may exhibit single-peaked fitness landscape (Fujiyama landscape (a)) or multi-peaked landscape (Badlands landscape (b)) kauffman1989nk. In Fujiyama landscape, any method could perform well. However, for the rougher Badlands landscape, previous methods get trapped in a worse local optima while our proposed IsEM-Pro can climb much closer to the global optima through the iterative sampling in the latent space.
  • Figure 2: Workflow of traditional protein sequence design. We aim to accelerate this process by directly generating desirable sequences.
  • Figure 3: Overall architecture of the proposed IsEM-Pro. The upper half illustrates the Markov random fields which learns the combinatorial structure of amino acids in protein sequences from the same family. The bottom half shows the combinatorial structure feature augmented probabilistic model. Red lines show the calculation of importance weight.
  • Figure 4: Approximate KL divergence on eight protein datasets. It shows the variance of KL divergence is very small over different sample size for all datasets, giving empirical evidence that when we sample from the ultimate $Q_{\phi}(x)$, it has minor difference compared with sampling from the posterior distribution $P_{\theta}(x|\mathcal{S})$.
  • Figure 5: 3-D visualization of our designed green fluorescent protein, validating that IsEM-Pro can generate realistic fluorescent protein.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Lemma 3.1
  • proof