Table of Contents
Fetching ...

Diffusion Model with Representation Alignment for Protein Inverse Folding

Chenglin Wang, Yucheng Zhou, Zijie Zhai, Jianbing Shen, Kai Zhang

TL;DR

Protein inverse folding aims to recover amino acid sequences from a backbone structure. We introduce DMRA, a diffusion-based approach with a Shared Center that aggregates global context and a Representation Alignment mechanism that grounds denoising in predefined amino acid semantics, enabling residue-level alignment to the correct types. The model combines a structured protein graph, a diffusion forward process over 20 amino acid types, and a denoising network with message passing, a global center, and semantic alignment, trained with a combined loss. On CATH4.2 and generalization benchmarks TS50/TS500, DMRA achieves state-of-the-art recovery and perplexity while remaining competitive with external knowledge methods, and ablations confirm the critical roles of the Shared Center and Representation Alignment in driving performance and interpretability.

Abstract

Protein inverse folding is a fundamental problem in bioinformatics, aiming to recover the amino acid sequences from a given protein backbone structure. Despite the success of existing methods, they struggle to fully capture the intricate inter-residue relationships critical for accurate sequence prediction. We propose a novel method that leverages diffusion models with representation alignment (DMRA), which enhances diffusion-based inverse folding by (1) proposing a shared center that aggregates contextual information from the entire protein structure and selectively distributes it to each residue; and (2) aligning noisy hidden representations with clean semantic representations during the denoising process. This is achieved by predefined semantic representations for amino acid types and a representation alignment method that utilizes type embeddings as semantic feedback to normalize each residue. In experiments, we conduct extensive evaluations on the CATH4.2 dataset to demonstrate that DMRA outperforms leading methods, achieving state-of-the-art performance and exhibiting strong generalization capabilities on the TS50 and TS500 datasets.

Diffusion Model with Representation Alignment for Protein Inverse Folding

TL;DR

Protein inverse folding aims to recover amino acid sequences from a backbone structure. We introduce DMRA, a diffusion-based approach with a Shared Center that aggregates global context and a Representation Alignment mechanism that grounds denoising in predefined amino acid semantics, enabling residue-level alignment to the correct types. The model combines a structured protein graph, a diffusion forward process over 20 amino acid types, and a denoising network with message passing, a global center, and semantic alignment, trained with a combined loss. On CATH4.2 and generalization benchmarks TS50/TS500, DMRA achieves state-of-the-art recovery and perplexity while remaining competitive with external knowledge methods, and ablations confirm the critical roles of the Shared Center and Representation Alignment in driving performance and interpretability.

Abstract

Protein inverse folding is a fundamental problem in bioinformatics, aiming to recover the amino acid sequences from a given protein backbone structure. Despite the success of existing methods, they struggle to fully capture the intricate inter-residue relationships critical for accurate sequence prediction. We propose a novel method that leverages diffusion models with representation alignment (DMRA), which enhances diffusion-based inverse folding by (1) proposing a shared center that aggregates contextual information from the entire protein structure and selectively distributes it to each residue; and (2) aligning noisy hidden representations with clean semantic representations during the denoising process. This is achieved by predefined semantic representations for amino acid types and a representation alignment method that utilizes type embeddings as semantic feedback to normalize each residue. In experiments, we conduct extensive evaluations on the CATH4.2 dataset to demonstrate that DMRA outperforms leading methods, achieving state-of-the-art performance and exhibiting strong generalization capabilities on the TS50 and TS500 datasets.

Paper Structure

This paper contains 24 sections, 19 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: The protein inverse folding problem defines the mapping from a 3D structure to an amino acid sequence. “Shared Center” refers to the shared information among all residues in the protein chain, which is used to maintain the communication of residues. Each residue on the chain consists of three atoms: ${\bm{C}}$, ${\bm{N}}$, and ${\bm{O}}$, with the central carbon atom known as ${\bm{C}}_\alpha$. The characters on the right represent different types of amino acids.
  • Figure 2: Overview illustration of our $\mathop{\mathrm{\textbf{DMRA}}}\limits$ model. We take a certain amino acid type in the sequence as an example (highlighted in the figure) to illustrate the process of the model. In the diffusion process, the amino acid type is represented by characters, where green characters represent the correct amino acid type and underlined red characters represent the incorrect amino acid type after transfer.
  • Figure 3: Details of the Shared Center module in the $\mathop{\mathrm{\textbf{DMRA}}}\limits$ model, where the shared node is initialized to aggregate contextual information of the entire protein chain, the $\mathbf{Cell}$ module is specifically formulated as Eq.(\ref{['c2']}-\ref{['c3']}).
  • Figure 4: Details of the Representation Alignment module in the $\mathop{\mathrm{\textbf{DMRA}}}\limits$ model. The true type is used to enforce the residue to align with the clean type embeddings during the denoising process.
  • Figure 5: Nonlinear features analysis of shared center module on layer output. The x-axis shows two adjacent layers, and the y-axis represents the logarithm of KL divergence, which measures changes in node feature distributions. The polyline depicts nonlinear divergences in node representations for 50 randomly selected protein cases from CATH4.2 test set with or without the shared center module. Arrows indicate the direction towards nodes with stronger nonlinear features.
  • ...and 5 more figures