Table of Contents
Fetching ...

Constraint Decoupled Latent Diffusion for Protein Backmapping

Xu Han, Yuancheng Sun, Kai Chen, Yuxuan Ren, Kang Liu, Qiwei Ye

TL;DR

CODLAD introduces a two-stage latent diffusion framework that decouples structural constraint handling from generation to backmap coarse-grained protein structures to all-atom detail. By encoding AA structures into discrete, constraint-informed latent representations via a dual-level SE(3)-equivariant GNN and then performing diffusion in this latent space conditioned on CG inputs, CODLAD achieves superior atomistic accuracy ($RMSD$), topological fidelity ($GED$), and conformational diversity ($DIV$) with substantial inference speedups. Across datasets including PED, ATLAS, PDB, and DES, the method demonstrates strong generalization to unseen trajectory systems, and ablation studies confirm the gains stem from constraint decoupling, discrete latent spaces, and latent-space diffusion. The work offers a scalable, generalizable pathway for accurate and diverse backbone-to-all-atom reconstructions, with code and resources publicly available for broader application.

Abstract

Coarse-grained (CG) molecular dynamics simulations enable efficient exploration of protein conformational ensembles. However, reconstructing atomic details from CG structures (backmapping) remains a challenging problem. Current approaches face an inherent trade-off between maintaining atomistic accuracy and exploring diverse conformations, often necessitating complex constraint handling or extensive refinement steps. To address these challenges, we introduce a novel two-stage framework, named CODLAD (COnstraint Decoupled LAtent Diffusion). This framework first compresses atomic structures into discrete latent representations, explicitly embedding structural constraints, thereby decoupling constraint handling from generation. Subsequently, it performs efficient denoising diffusion in this latent space to produce structurally valid and diverse all-atom conformations. Comprehensive evaluations on diverse protein datasets demonstrate that CODLAD achieves state-of-the-art performance in atomistic accuracy, conformational diversity, and computational efficiency while exhibiting strong generalization across different protein systems. Code is available at https://github.com/xiaoxiaokuye/CODLAD.

Constraint Decoupled Latent Diffusion for Protein Backmapping

TL;DR

CODLAD introduces a two-stage latent diffusion framework that decouples structural constraint handling from generation to backmap coarse-grained protein structures to all-atom detail. By encoding AA structures into discrete, constraint-informed latent representations via a dual-level SE(3)-equivariant GNN and then performing diffusion in this latent space conditioned on CG inputs, CODLAD achieves superior atomistic accuracy (), topological fidelity (), and conformational diversity () with substantial inference speedups. Across datasets including PED, ATLAS, PDB, and DES, the method demonstrates strong generalization to unseen trajectory systems, and ablation studies confirm the gains stem from constraint decoupling, discrete latent spaces, and latent-space diffusion. The work offers a scalable, generalizable pathway for accurate and diverse backbone-to-all-atom reconstructions, with code and resources publicly available for broader application.

Abstract

Coarse-grained (CG) molecular dynamics simulations enable efficient exploration of protein conformational ensembles. However, reconstructing atomic details from CG structures (backmapping) remains a challenging problem. Current approaches face an inherent trade-off between maintaining atomistic accuracy and exploring diverse conformations, often necessitating complex constraint handling or extensive refinement steps. To address these challenges, we introduce a novel two-stage framework, named CODLAD (COnstraint Decoupled LAtent Diffusion). This framework first compresses atomic structures into discrete latent representations, explicitly embedding structural constraints, thereby decoupling constraint handling from generation. Subsequently, it performs efficient denoising diffusion in this latent space to produce structurally valid and diverse all-atom conformations. Comprehensive evaluations on diverse protein datasets demonstrate that CODLAD achieves state-of-the-art performance in atomistic accuracy, conformational diversity, and computational efficiency while exhibiting strong generalization across different protein systems. Code is available at https://github.com/xiaoxiaokuye/CODLAD.

Paper Structure

This paper contains 30 sections, 15 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Overview of backmapping. All-atom simulations are limited by high computational cost and short timescales (ns to $\mu$s), while CG simulations can reach longer timescales (ms to s) at lower cost. Backmapping enables recovery of valid atomic details from CG structures and facilitate atomic-level tasks.
  • Figure 2: Overview of CODLAD's two-stage framework. (a) Compression stage: All-atom structures are encoded with atom-level and residue-level (C$_\alpha$) message passing, with cross-level information exchange (see Eq. (\ref{['eq:hierarchical_mp']})), to produce residue-level latent representations. The decoder predicts internal coordinates and deterministically maps them to Cartesian coordinates. Reconstruction employs both coordinate-based and structural constraint losses (see Eq. (\ref{['eq:vaeloss']})), such as bond-length, bond-angle, and torsion terms, along with a clash penalty to discourage atom overlap. (Purple denotes atom-level operations; green denotes latent modules.) (b) Latent stage: From the residue-level latent $\mathbf{h_0}$, a forward noising process $q(\mathbf{h}_t\!\mid\!\mathbf{h}_{t-1})$ gradually adds small Gaussian noise (visualized as green$\rightarrow$gray blocks, Gray indicate noisy latent states). A learned reverse process $p(\mathbf{h}_{t-1}\!\mid\!\mathbf{h}_t,\mathbf{X},\mathbf{A})$ denoises in latent space (conditioned on the CG graph, where $\mathbf{X}$ and $\mathbf{A}$ denote CG coordinates and residue type) to recover valid latent representations, decoded as in (a) to generate all-atom structures following $p(\mathbf{x}\mid\mathbf{X},\mathbf{A})$. Panel (a) corresponds to the hierarchical message passing in Eq. (\ref{['eq:hierarchical_mp']}) and the autoencoder loss in Eq. (\ref{['eq:vaeloss']}); panel (b) follows the conditional denoising objective in Eq. (\ref{['eq:final_denoise_loss']}) and the sampling procedure summarized in Algs. S3–S4.
  • Figure 3: Schematic representation of internal coordinates in a protein residue. For illustrative purposes, the conversion process is demonstrated using the $\text{C}_\beta$ atom (labeled as atom 5) as an example: (1) The bond length $d$ is computed as the distance between $\text{C}_\beta$ (atom 5) and $\text{C}$ (atom 4). (2) The bond angle $\theta$ is calculated as the angle formed by $\text{N}$ (atom 3), $\text{C}$ (atom 4), and $\text{C}_\beta$ (atom 5). (3) The dihedral angle $\tau$ is determined from the planes formed by $\text{C}_\alpha$ (atom 2), $\text{N}$ (atom 3), $\text{C}$ (atom 4), and $\text{C}_\beta$ (atom 5).
  • Figure 4: Illustration of the compression process. (a) Protein structures are compressed from all-atom representation to a low-dimensional graph with learned node features. (b) The continuous features are further discretized using a learned codebook for dimensionality reduction.
  • Figure 5: Visualization of PED00055 protein structure generation from the PED dataset. Our method (e) maintains accurate structural validity near flexible side chains (red circles), closely matching the ground truth (a). In contrast, GenZProt (b) and DiAMoNDBack (c) generate conflicting side chain atoms in these regions, while FlowBack (d) does not produce obvious conflicts but yields erroneous backbone topology (red C$_\alpha$ atoms), which may result from directly performing diffusion in the full-atom space.
  • ...and 4 more figures