Table of Contents
Fetching ...

Bridge-IF: Learning Inverse Protein Folding with Markov Bridges

Yiheng Zhu, Jialu Wu, Qiuyi Li, Jiahuan Yan, Mingze Yin, Wei Wu, Mingyang Li, Jieping Ye, Zheng Wang, Jian Wu

TL;DR

This work proposes Bridge-IF, a generative diffusion bridge model for inverse folding, which is designed to learn the probabilistic dependency between the distributions of backbone structures and protein sequences, and introduces a reparameterization perspective on Markov bridge models.

Abstract

Inverse protein folding is a fundamental task in computational protein design, which aims to design protein sequences that fold into the desired backbone structures. While the development of machine learning algorithms for this task has seen significant success, the prevailing approaches, which predominantly employ a discriminative formulation, frequently encounter the error accumulation issue and often fail to capture the extensive variety of plausible sequences. To fill these gaps, we propose Bridge-IF, a generative diffusion bridge model for inverse folding, which is designed to learn the probabilistic dependency between the distributions of backbone structures and protein sequences. Specifically, we harness an expressive structure encoder to propose a discrete, informative prior derived from structures, and establish a Markov bridge to connect this prior with native sequences. During the inference stage, Bridge-IF progressively refines the prior sequence, culminating in a more plausible design. Moreover, we introduce a reparameterization perspective on Markov bridge models, from which we derive a simplified loss function that facilitates more effective training. We also modulate protein language models (PLMs) with structural conditions to precisely approximate the Markov bridge process, thereby significantly enhancing generation performance while maintaining parameter-efficient training. Extensive experiments on well-established benchmarks demonstrate that Bridge-IF predominantly surpasses existing baselines in sequence recovery and excels in the design of plausible proteins with high foldability. The code is available at https://github.com/violet-sto/Bridge-IF.

Bridge-IF: Learning Inverse Protein Folding with Markov Bridges

TL;DR

This work proposes Bridge-IF, a generative diffusion bridge model for inverse folding, which is designed to learn the probabilistic dependency between the distributions of backbone structures and protein sequences, and introduces a reparameterization perspective on Markov bridge models.

Abstract

Inverse protein folding is a fundamental task in computational protein design, which aims to design protein sequences that fold into the desired backbone structures. While the development of machine learning algorithms for this task has seen significant success, the prevailing approaches, which predominantly employ a discriminative formulation, frequently encounter the error accumulation issue and often fail to capture the extensive variety of plausible sequences. To fill these gaps, we propose Bridge-IF, a generative diffusion bridge model for inverse folding, which is designed to learn the probabilistic dependency between the distributions of backbone structures and protein sequences. Specifically, we harness an expressive structure encoder to propose a discrete, informative prior derived from structures, and establish a Markov bridge to connect this prior with native sequences. During the inference stage, Bridge-IF progressively refines the prior sequence, culminating in a more plausible design. Moreover, we introduce a reparameterization perspective on Markov bridge models, from which we derive a simplified loss function that facilitates more effective training. We also modulate protein language models (PLMs) with structural conditions to precisely approximate the Markov bridge process, thereby significantly enhancing generation performance while maintaining parameter-efficient training. Extensive experiments on well-established benchmarks demonstrate that Bridge-IF predominantly surpasses existing baselines in sequence recovery and excels in the design of plausible proteins with high foldability. The code is available at https://github.com/violet-sto/Bridge-IF.

Paper Structure

This paper contains 34 sections, 1 theorem, 15 equations, 4 figures, 4 tables, 2 algorithms.

Key Result

Proposition 4.1

The loss objective $\mathcal{L}_t(\theta)$ for sequence $x$ at the $t$-th step can be reduced to the form where $\lambda_t = 1 - \beta_t$.

Figures (4)

  • Figure 1: Overview of Bridge-IF. Bridge-IF consists of an expressive structure encoder supervised by native sequences for proposing a discrete, deterministic prior, and a Markov bridge model for learning the dependency between the distribution of prior sequences and the distribution of native sequences. During the inference stage, Bridge-IF progressively refines the prior sequence.
  • Figure 2: Model architecture of Bridge-IF.
  • Figure 3: Performance comparison w.r.t. model scales of pLMs using ESM-2 series on CATH 4.3.
  • Figure 4: Folding comparison of our designed sequences (in blue) and the native sequences (in nude).

Theorems & Definitions (1)

  • Proposition 4.1