An Iterative Framework for Generative Backmapping of Coarse Grained Proteins
Georgios Kementzidis, Erin Wong, John Nicholson, Ruichen Xu, Yuefan Deng
TL;DR
This work tackles the challenge of reconstructing fine-grained (FG) protein structures from ultra-coarse-grained (UCG) representations by introducing an iterative backmapping framework built on conditional variational autoencoders and graph neural networks. It formalizes backmapping as a chain of conditional distributions and derives a k-step evidence lower bound (ELBO) to enable separate optimization at each resolution, enabling a practical divide-and-conquer approach. A two-step scheme that pairs CGVAE (for Cα traces) with GenZProt (for FG reconstruction) demonstrates substantial improvements over a 1-step baseline across metrics such as RMSD, Graph Edit Distance, steric clashes, and Ramachandran-consistent secondary structure for two proteins with different structural characteristics, notably eIF4E and PED00151. The method offers memory-efficient, modular training and scalable accuracy gains, highlighting its potential for generating physically plausible FG conformations from ultra-coarse representations in biomolecular simulations, while also outlining future extensions to deeper multi-step schemes and IDP handling.
Abstract
The techniques of data-driven backmapping from coarse-grained (CG) to fine-grained (FG) representation often struggle with accuracy, unstable training, and physical realism, especially when applied to complex systems such as proteins. In this work, we introduce a novel iterative framework by using conditional Variational Autoencoders and graph-based neural networks, specifically designed to tackle the challenges associated with such large-scale biomolecules. Our method enables stepwise refinement from CG beads to full atomistic details. We outline the theory of iterative generative backmapping and demonstrate via numerical experiments the advantages of multistep schemes by applying them to proteins of vastly different structures with very coarse representations. This multistep approach not only improves the accuracy of reconstructions but also makes the training process more computationally efficient for proteins with ultra-CG representations.
