Importance Weighted Expectation-Maximization for Protein Sequence Design
Zhenqiao Song, Lei Li
TL;DR
IsEM-Pro introduces a structure-enhanced latent generative framework for protein sequence design, combining a latent Transformer-based generator with Markov random field constraints and learning via importance-weighted Monte Carlo EM. By sampling in latent space and leveraging combinatorial structure, it achieves higher fitness, diversity, and novelty across eight benchmarks compared with strong baselines, including CMA-ES, DbAS, CbAS, and GFlowNet variants. The approach includes theoretical and empirical analysis showing close alignment between the proposal $Q_{\phi}(x)$ and the posterior $P_{\theta}(x|\mathcal{S})$, and demonstrates case studies where designed proteins exhibit plausible folding and structural similarity to known proteins. Overall, IsEM-Pro provides a scalable, principled framework for efficient, diverse protein sequence design with practical implications for accelerated wet-lab discovery.
Abstract
Designing protein sequences with desired biological function is crucial in biology and chemistry. Recent machine learning methods use a surrogate sequence-function model to replace the expensive wet-lab validation. How can we efficiently generate diverse and novel protein sequences with high fitness? In this paper, we propose IsEM-Pro, an approach to generate protein sequences towards a given fitness criterion. At its core, IsEM-Pro is a latent generative model, augmented by combinatorial structure features from a separately learned Markov random fields (MRFs). We develop an Monte Carlo Expectation-Maximization method (MCEM) to learn the model. During inference, sampling from its latent space enhances diversity while its MRFs features guide the exploration in high fitness regions. Experiments on eight protein sequence design tasks show that our IsEM-Pro outperforms the previous best methods by at least 55% on average fitness score and generates more diverse and novel protein sequences.
