Table of Contents
Fetching ...

RadDiff: Retrieval-Augmented Denoising Diffusion for Protein Inverse Folding

Jin Han, Tianfan Fu, Wu-Jun Li

TL;DR

RadDiff addresses protein inverse folding by integrating external evolutionary knowledge through retrieval of structurally similar proteins into a diffusion-based sequence design framework. It constructs a residue-level amino acid profile from residue-wise alignments of retrieved structures and fuses this profile with 3D structure representations via a lightweight integration module, further enhanced by a masked-prior denoising strategy. Empirically, RadDiff achieves state-of-the-art performance on CATH, PDB, and TS50, improving sequence recovery by up to 19% while maintaining parameter efficiency and scalable retrieval as database size grows. This retrieval-augmented diffusion approach enables flexible, up-to-date utilization of growing protein databases without relying on massive PLMs, offering a practical path for evolutionary-informed protein design.

Abstract

Protein inverse folding, the design of an amino acid sequence based on a target 3D structure, is a fundamental problem of computational protein engineering. Existing methods either generate sequences without leveraging external knowledge or relying on protein language models (PLMs). The former omits the evolutionary information stored in protein databases, while the latter is parameter-inefficient and inflexible to adapt to ever-growing protein data. To overcome the above drawbacks, in this paper we propose a novel method, called retrieval-augmented denoising diffusion (RadDiff), for protein inverse folding. Given the target protein backbone, RadDiff uses a hierarchical search strategy to efficiently retrieve structurally similar proteins from large protein databases. The retrieved structures are then aligned residue-by-residue to the target to construct a position-specific amino acid profile, which serves as an evolutionary-informed prior that conditions the denoising process. A lightweight integration module is further designed to incorporate this prior effectively. Experimental results on the CATH, PDB, and TS50 datasets show that RadDiff consistently outperforms existing methods, improving sequence recovery rate by up to 19%. Experimental results also demonstrate that RadDiff generates highly foldable sequences and scales effectively with database size.

RadDiff: Retrieval-Augmented Denoising Diffusion for Protein Inverse Folding

TL;DR

RadDiff addresses protein inverse folding by integrating external evolutionary knowledge through retrieval of structurally similar proteins into a diffusion-based sequence design framework. It constructs a residue-level amino acid profile from residue-wise alignments of retrieved structures and fuses this profile with 3D structure representations via a lightweight integration module, further enhanced by a masked-prior denoising strategy. Empirically, RadDiff achieves state-of-the-art performance on CATH, PDB, and TS50, improving sequence recovery by up to 19% while maintaining parameter efficiency and scalable retrieval as database size grows. This retrieval-augmented diffusion approach enables flexible, up-to-date utilization of growing protein databases without relying on massive PLMs, offering a practical path for evolutionary-informed protein design.

Abstract

Protein inverse folding, the design of an amino acid sequence based on a target 3D structure, is a fundamental problem of computational protein engineering. Existing methods either generate sequences without leveraging external knowledge or relying on protein language models (PLMs). The former omits the evolutionary information stored in protein databases, while the latter is parameter-inefficient and inflexible to adapt to ever-growing protein data. To overcome the above drawbacks, in this paper we propose a novel method, called retrieval-augmented denoising diffusion (RadDiff), for protein inverse folding. Given the target protein backbone, RadDiff uses a hierarchical search strategy to efficiently retrieve structurally similar proteins from large protein databases. The retrieved structures are then aligned residue-by-residue to the target to construct a position-specific amino acid profile, which serves as an evolutionary-informed prior that conditions the denoising process. A lightweight integration module is further designed to incorporate this prior effectively. Experimental results on the CATH, PDB, and TS50 datasets show that RadDiff consistently outperforms existing methods, improving sequence recovery rate by up to 19%. Experimental results also demonstrate that RadDiff generates highly foldable sequences and scales effectively with database size.

Paper Structure

This paper contains 37 sections, 15 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: RadDiff's pipeline. (1) Hierarchical search: retrieve a structurally similar protein set $\mathcal{R}_q$ from the database $\mathcal{D}_{db}$. (2) Residue-level alignment: superimpose the retrieved proteins onto the query structure using US-align, and use the aligned residues as references for the amino acid types in the original sequence. (3) Generating amino acid profile: the amino acid profile is the position-specific amino acid probabilities, which serves as an evolutionary-informed prior that directly conditions the denoising process. The red color denotes the incorrect amino acid type after aligning, while the other color denotes the correct type.
  • Figure 2: Illustration of strategy to prevent data leakage.
  • Figure 3: The relationship between the size of external database, hit numbers, and sequence recovery.
  • Figure 4: The relationship between the average TM-score of test proteins and their retrieved proteins, and sequence recovery.