Table of Contents
Fetching ...

Refold: Refining Protein Inverse Folding with Efficient Structural Matching and Fusion

Yiran Zhu, Changxi Chi, Hongxin Xiang, Wenjie Du, Xiaoqi Wang, Jun Xia

Abstract

Protein inverse folding aims to design an amino acid sequence that will fold into a given backbone structure, serving as a central task in protein design. Two main paradigms have been widely explored. Template-based methods exploit database-derived structural priors and can achieve high local precision when close structural neighbors are available, but their dependence on database coverage and match quality often degrades performance on out-of-distribution (OOD) targets. Deep learning approaches, in contrast, learn general structure-to-sequence regularities and usually generalize better to new backbones. However, they struggle to capture fine-grained local structure, which can cause uncertain residue predictions and missed local motifs in ambiguous regions. We introduce Refold, a novel framework that synergistically integrates the strengths of database-derived structural priors and deep learning prediction to enhance inverse folding. Refold obtains structural priors from matched neighbors and fuses them with model predictions to refine residue probabilities. In practice, low-quality neighbors can introduce noise, potentially degrading model performance. We address this issue with a Dynamic Utility Gate that controls prior injection and falls back to the base prediction when the priors are untrustworthy. Comprehensive evaluations on standard benchmarks demonstrate that Refold achieves state-of-the-art native sequence recovery of 0.63 on both CATH 4.2 and CATH 4.3. Also, analysis indicates that Refold delivers larger gains on high-uncertainty regions, reflecting the complementarity between structural priors and deep learning predictions.

Refold: Refining Protein Inverse Folding with Efficient Structural Matching and Fusion

Abstract

Protein inverse folding aims to design an amino acid sequence that will fold into a given backbone structure, serving as a central task in protein design. Two main paradigms have been widely explored. Template-based methods exploit database-derived structural priors and can achieve high local precision when close structural neighbors are available, but their dependence on database coverage and match quality often degrades performance on out-of-distribution (OOD) targets. Deep learning approaches, in contrast, learn general structure-to-sequence regularities and usually generalize better to new backbones. However, they struggle to capture fine-grained local structure, which can cause uncertain residue predictions and missed local motifs in ambiguous regions. We introduce Refold, a novel framework that synergistically integrates the strengths of database-derived structural priors and deep learning prediction to enhance inverse folding. Refold obtains structural priors from matched neighbors and fuses them with model predictions to refine residue probabilities. In practice, low-quality neighbors can introduce noise, potentially degrading model performance. We address this issue with a Dynamic Utility Gate that controls prior injection and falls back to the base prediction when the priors are untrustworthy. Comprehensive evaluations on standard benchmarks demonstrate that Refold achieves state-of-the-art native sequence recovery of 0.63 on both CATH 4.2 and CATH 4.3. Also, analysis indicates that Refold delivers larger gains on high-uncertainty regions, reflecting the complementarity between structural priors and deep learning predictions.
Paper Structure (24 sections, 8 equations, 5 figures, 6 tables)

This paper contains 24 sections, 8 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Template-based methods use database-derived structural priors to achieve high local precision but often show poor OOD generalization. Deep learning approaches generalize better to new backbones yet struggle with fine-grained local structure. Refold integrates these complementary strengths to improve both OOD generalization and local precision.
  • Figure 2: Overview of Refold. (a) We perform structural matching to search matched neighbors $k$ for the target backbone $\mathcal{S}$. (b) These neighbors are organized into a Stacked Neighbor Alignment matrix $\mathbf{A}$, where row 0 serves as a base anchor derived from the base logits $\operatorname*{argmax}(z_{\text{base}})$. A Similarity-Weighted Fusion Module distills $\mathbf{A}$ into reference logits $z_{\text{ref}}$ and linearly fuses them with $z_{\text{base}}$ to form $p_{\text{fused}}$. (c) To mitigate the adverse impact of low-quality neighbors, a Dynamic Utility Gate guided by the Inference-Time Statistics modulates the fusion, falling back to the base prediction when structural priors are unreliable.
  • Figure 3: Schematic of the Similarity-Weighted Fusion Module. From the base logits $z_{\text{base}}$, we construct a Stacked Neighbor Alignment matrix $\mathbf{A}$: row 0 serves as the base anchor $\operatorname{argmax}\!\left(z_{\text{base}}\right)$, while rows $1,\dots,K$ contain aligned matched neighbor tokens. The module embeds $\mathbf{A}$, applies row-wise smoothing to mitigate local noise, and aggregates structural priors via attention weighted by the reliability bias $\beta$. This process produces reference logits $z_{\text{ref}}$, which are then residually fused with the base logits to yield $z_{\text{fused}} = z_{\text{base}} + \lambda \cdot z_{\text{ref}}$.
  • Figure 4: Site-wise recovery states (sampled proteins). We sample $N$ proteins and visualize the first 50 residues. Each cell denotes a residue-level transition from the Base model to Refold: Neg (wrong$\rightarrow$wrong), Pos (correct$\rightarrow$correct), Neg$\rightarrow$Pos (wrong$\rightarrow$correct, correction), and Pos$\rightarrow$Neg (correct$\rightarrow$wrong, error). Corrections (Neg$\rightarrow$Pos) tend to appear in localized segments, while error events (Pos$\rightarrow$Neg) are sparse, suggesting that Refold performs targeted, structure-aligned refinements rather than indiscriminate perturbations.
  • Figure 5: Sensitivity analysis of the number of matched neighbors $K$. The performance gain saturates around $K=10$, indicating that the model aggregates structural priors effectively without requiring a large number of neighbors.