Table of Contents
Fetching ...

Scalable Normalizing Flows Enable Boltzmann Generators for Macromolecules

Joseph C. Kim, David Bloore, Karan Kapoor, Jun Feng, Ming-Hong Hao, Mengdi Wang

TL;DR

The paper addresses the challenge of scalable Boltzmann sampling for macromolecules by introducing a split-channel normalizing-flow architecture that operates in reduced internal coordinates and employs gated-attention coupling layers. A multi-stage training regimen blends maximum-likelihood and energy-based objectives, with a backbone-focused 2-Wasserstein loss on distance matrices to enforce global structural fidelity while preserving local details. Evaluations on HP35 and Protein G demonstrate improved backbone geometry, low-energy generated conformations, and the ability to discover novel metastable states not present in training, outperforming traditional NSF baselines. These advances enable more efficient and physically grounded sampling of protein conformations, with potential impact on drug design and understanding of functional states, while highlighting avenues for transferability and further methodology enhancements.

Abstract

The Boltzmann distribution of a protein provides a roadmap to all of its functional states. Normalizing flows are a promising tool for modeling this distribution, but current methods are intractable for typical pharmacological targets; they become computationally intractable due to the size of the system, heterogeneity of intra-molecular potential energy, and long-range interactions. To remedy these issues, we present a novel flow architecture that utilizes split channels and gated attention to efficiently learn the conformational distribution of proteins defined by internal coordinates. We show that by utilizing a 2-Wasserstein loss, one can smooth the transition from maximum likelihood training to energy-based training, enabling the training of Boltzmann Generators for macromolecules. We evaluate our model and training strategy on villin headpiece HP35(nle-nle), a 35-residue subdomain, and protein G, a 56-residue protein. We demonstrate that standard architectures and training strategies, such as maximum likelihood alone, fail while our novel architecture and multi-stage training strategy are able to model the conformational distributions of protein G and HP35.

Scalable Normalizing Flows Enable Boltzmann Generators for Macromolecules

TL;DR

The paper addresses the challenge of scalable Boltzmann sampling for macromolecules by introducing a split-channel normalizing-flow architecture that operates in reduced internal coordinates and employs gated-attention coupling layers. A multi-stage training regimen blends maximum-likelihood and energy-based objectives, with a backbone-focused 2-Wasserstein loss on distance matrices to enforce global structural fidelity while preserving local details. Evaluations on HP35 and Protein G demonstrate improved backbone geometry, low-energy generated conformations, and the ability to discover novel metastable states not present in training, outperforming traditional NSF baselines. These advances enable more efficient and physically grounded sampling of protein conformations, with potential impact on drug design and understanding of functional states, while highlighting avenues for transferability and further methodology enhancements.

Abstract

The Boltzmann distribution of a protein provides a roadmap to all of its functional states. Normalizing flows are a promising tool for modeling this distribution, but current methods are intractable for typical pharmacological targets; they become computationally intractable due to the size of the system, heterogeneity of intra-molecular potential energy, and long-range interactions. To remedy these issues, we present a novel flow architecture that utilizes split channels and gated attention to efficiently learn the conformational distribution of proteins defined by internal coordinates. We show that by utilizing a 2-Wasserstein loss, one can smooth the transition from maximum likelihood training to energy-based training, enabling the training of Boltzmann Generators for macromolecules. We evaluate our model and training strategy on villin headpiece HP35(nle-nle), a 35-residue subdomain, and protein G, a 56-residue protein. We demonstrate that standard architectures and training strategies, such as maximum likelihood alone, fail while our novel architecture and multi-stage training strategy are able to model the conformational distributions of protein G and HP35.
Paper Structure (30 sections, 11 equations, 7 figures, 4 tables)

This paper contains 30 sections, 11 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: (a) Our split flow architecture. (b) Each transformation block consists of a gated attention rational quadratic spline (RQS) coupling layer. (c) Example structures of protein G from the flow $q_\theta$ (left) and from molecular dynamics simulation $p$ (right). We also show sample distance matrices $\mathbf{D}(\mathbf{x}_{q_\theta})$ and $\mathbf{D}(\mathbf{x}_p)$.
  • Figure 2: A two residue chain. Hydrogens on carbon atoms are omitted. Backbone atoms are highlighted green. Shown is an example of a bond length $d$, a bond angle $\theta$, and a dihedral/torsion angle $\phi$.
  • Figure 3: Sample conformations generated by BG via different training strategies. (a) Root mean square fluctuation (RMSF) computed for each residue (C$\alpha$ atoms) in HP35 and protein G. Matching the training dataset's plot is desirable. (b) Examples of HP35 from ground truth training data, generated samples from our model, and generated samples from the baseline model. (c) Example of two metastable states from protein G training data. (d) Low-energy conformations of protein G generated by our model superimposed on each other. We also show some examples of pathological structures generated after training with different training paradigms: NLL (maximum likelihood), both NLL and KL divergence, and NLL and the 2-Wasserstein loss. Atom clashes are highlighted with red circles.
  • Figure 4: BGs can generate novel sample conformations. (a) Protein G 2D UMAP embeddings for the training data, test data, and $2 \times 10^5$ generated samples. (b) A representative example of generated structures by the BG model which was not found in training data (cyan) and the closest structure in the training dataset (magenta) by RMSD. Both structures are depicted as stars with their respective structural colors in (a). (c) Protein G energy distribution of training dataset (orange) and samples (blue) generated by our model. The second energy peak of the sampled conformations covers the novel structure shown in (b). (d) An overlay of high-resolution, lowest-energy all-atom structures of protein G generated by the BG model. This demonstrates that our model is capable of sampling low-energy conformations at atomic resolution.
  • Figure S.1: An illustration of the definition of bond length, bond angle, and dihedral angle by four atoms. Subscripts indicate the atoms that define the value, where order is given by the bond graph connectivity. In internal coordinate system, the position or Cartesian coordinate of atom 4 is determined by atom 1,2 and 3 based on bond length, bond angle and dihedral angle.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Definition 3.1: Distance Distortion