Diffusion Model with Representation Alignment for Protein Inverse Folding
Chenglin Wang, Yucheng Zhou, Zijie Zhai, Jianbing Shen, Kai Zhang
TL;DR
Protein inverse folding aims to recover amino acid sequences from a backbone structure. We introduce DMRA, a diffusion-based approach with a Shared Center that aggregates global context and a Representation Alignment mechanism that grounds denoising in predefined amino acid semantics, enabling residue-level alignment to the correct types. The model combines a structured protein graph, a diffusion forward process over 20 amino acid types, and a denoising network with message passing, a global center, and semantic alignment, trained with a combined loss. On CATH4.2 and generalization benchmarks TS50/TS500, DMRA achieves state-of-the-art recovery and perplexity while remaining competitive with external knowledge methods, and ablations confirm the critical roles of the Shared Center and Representation Alignment in driving performance and interpretability.
Abstract
Protein inverse folding is a fundamental problem in bioinformatics, aiming to recover the amino acid sequences from a given protein backbone structure. Despite the success of existing methods, they struggle to fully capture the intricate inter-residue relationships critical for accurate sequence prediction. We propose a novel method that leverages diffusion models with representation alignment (DMRA), which enhances diffusion-based inverse folding by (1) proposing a shared center that aggregates contextual information from the entire protein structure and selectively distributes it to each residue; and (2) aligning noisy hidden representations with clean semantic representations during the denoising process. This is achieved by predefined semantic representations for amino acid types and a representation alignment method that utilizes type embeddings as semantic feedback to normalize each residue. In experiments, we conduct extensive evaluations on the CATH4.2 dataset to demonstrate that DMRA outperforms leading methods, achieving state-of-the-art performance and exhibiting strong generalization capabilities on the TS50 and TS500 datasets.
