Table of Contents
Fetching ...

Bayesian learning for accurate and robust biomolecular force fields

Vojtech Kostal, Brennon L. Shanks, Pavel Jungwirth, Hector Martinez-Seara

TL;DR

The paper tackles the challenge of parameterizing biomolecular force fields with quantified uncertainty by introducing a Bayesian framework that learns partial charges from ab initio MD data in explicit solvent. It couples this with a computationally efficient Local Gaussian Process surrogate to enable likelihood evaluation during Bayesian inference, enabling robust, transferable parameter estimates across diverse molecular fragments. The authors demonstrate improved agreement with high-level references and experimental observables, achieving subpercent accuracy for densities and reasonable accuracy for solvation and binding properties, while providing a principled uncertainty quantification through posterior distributions. They further validate transferability by applying fragment-derived charges to a calcium-binding problem in cardiac troponin C, showing close alignment with experimental binding free energies and highlighting the method’s potential to bridge electronic-structure accuracy with classical-scale simulations. The approach is positioned as a general, open framework for uncertainty-aware force-field development that can integrate diverse data sources and scale with advances in GPU-accelerated inference, although challenges remain for very high-dimensional parameter spaces and representative training data.

Abstract

Molecular dynamics is a valuable tool to probe biological processes at the atomistic level - a resolution often elusive to experiments. However, the credibility of molecular models is limited by the accuracy of the underlying force field, which is often parametrized relying on ad hoc assumptions. To address this gap, we present a Bayesian framework for learning physically grounded parameters directly from ab initio molecular dynamics data. By representing both model parameters and data probabilistically, the framework yields interpretable, statistically rigorous models in which uncertainty and transferability emerge naturally from the learning process. This approach provides a transparent, data-driven foundation for developing predictive molecular models and enhances confidence in computational descriptions of biophysical systems. We demonstrate the method using 18 biologically relevant molecular fragments that capture key motifs in proteins, nucleic acids, and lipids, and, as a proof of concept, apply it to calcium binding to troponin - a central event in cardiac regulation.

Bayesian learning for accurate and robust biomolecular force fields

TL;DR

The paper tackles the challenge of parameterizing biomolecular force fields with quantified uncertainty by introducing a Bayesian framework that learns partial charges from ab initio MD data in explicit solvent. It couples this with a computationally efficient Local Gaussian Process surrogate to enable likelihood evaluation during Bayesian inference, enabling robust, transferable parameter estimates across diverse molecular fragments. The authors demonstrate improved agreement with high-level references and experimental observables, achieving subpercent accuracy for densities and reasonable accuracy for solvation and binding properties, while providing a principled uncertainty quantification through posterior distributions. They further validate transferability by applying fragment-derived charges to a calcium-binding problem in cardiac troponin C, showing close alignment with experimental binding free energies and highlighting the method’s potential to bridge electronic-structure accuracy with classical-scale simulations. The approach is positioned as a general, open framework for uncertainty-aware force-field development that can integrate diverse data sources and scale with advances in GPU-accelerated inference, although challenges remain for very high-dimensional parameter spaces and representative training data.

Abstract

Molecular dynamics is a valuable tool to probe biological processes at the atomistic level - a resolution often elusive to experiments. However, the credibility of molecular models is limited by the accuracy of the underlying force field, which is often parametrized relying on ad hoc assumptions. To address this gap, we present a Bayesian framework for learning physically grounded parameters directly from ab initio molecular dynamics data. By representing both model parameters and data probabilistically, the framework yields interpretable, statistically rigorous models in which uncertainty and transferability emerge naturally from the learning process. This approach provides a transparent, data-driven foundation for developing predictive molecular models and enhances confidence in computational descriptions of biophysical systems. We demonstrate the method using 18 biologically relevant molecular fragments that capture key motifs in proteins, nucleic acids, and lipids, and, as a proof of concept, apply it to calcium binding to troponin - a central event in cardiac regulation.

Paper Structure

This paper contains 9 sections, 12 equations, 5 figures.

Figures (5)

  • Figure 1: Overview of the Bayesian inference workflow. The workflow begins with data acquisition (a): a set of $N$ randomized partial charge vectors $\boldsymbol{\theta}$ collected in matrix $\mathbf{X}$ are used to run FFMD simulations and the corresponding QoI outputs are stored in matrix $\mathbf{Y}$. AIMD simulations provide reference trajectories and QoIs are extracted into vector $\boldsymbol{y}$. The training output matrix $\mathbf{Y}$ is partitioned column-wise into sub-matrices $\mathbf{Y}^{(k)}$ corresponding to the $k$-th QoI. In surrogate modeling (b), a separate LGP is trained for each QoI. Kernel hyperparameters are optimized to the FFMD dataset $\{\mathbf{X}, \mathbf{Y^{(k)}}\}$ using leave-one-out marginal likelihood. The resulting inverse kernel matrix is precomputed to construct the final LGP surrogate. In parameter optimization (c), Bayes' theorem combines a prior probability with the likelihood of observing the AIMD reference features $\boldsymbol{y}$ given predicted FFMD QoIs from the LGPs. Posterior sampling is performed using Markov Chain Monte Carlo (MCMC). In each iteration, new parameters $\boldsymbol{\theta}^*$ and QoI-specific nuisance parameters $\boldsymbol{n}^*$ are proposed, FFMD QoIs $\boldsymbol{y}(\boldsymbol{\theta}^*)$ are predicted from the set of surrogates, and the likelihood is evaluated. Multiplying the likelihood by the prior yields the posterior value up to a normalization constant. The MCMC loop continues until the posterior distribution converges, yielding the optimized force field parameter distribution.
  • Figure 2: Accuracy and validation of optimized partial charges.a: Licorice representations of the parameterized species grouped according to their net charge. b: Boxplots showing NMAE of ten samples from the partial charge posterior distribution against AIMD references for three QoIs, color-coded by molecular charge: neutral (blue), anions (pink), and cations (orange). Subpanels indicate relative average improvement in percentage with respect to CHARMM36-nbfix: green - improvement, red - regression. c: Relative deviations ($\Delta$) between simulated and experimental densities for selected species at different concentrations 298 K. The green band indicates the $\pm1\%$ error margin.
  • Figure 3: Chemical insights from the Baysian inference of partial charge distributions.a-c: posterior mean partial charges (bullets) with 95% confidence intervals (errorbars) for all atom types across the set of species parameterized in this work, color-coded by chemical element. The species are grouped by their net charge into individual panels as (a) neutrals, (b) anions, and (c) cations. b: Graph representation of the atomic partial charges grouped based on their chemical similarity. Nodes of the graph represent the individual atomtypes, while the edges present the covalent bonds if there is such a physical connection.
  • Figure 4: Calcium binding to the regulatory domain of human cardiac troponin-C (N-cTnC).a-c: Cartoon representations of the relevant parts of the N-cTnC. a: Overall structure of N-cTnC (transparent blue) with bound Ca$^{2+}$ (green). b: Close-up of the EF-hand loop (orange) highlighting Ca$^{2+}$-coordinating residues (licorice representation) c: Zoom-in of the characteristic carboxylate motif involved in Ca$^{2+}$ binding. d: Posterior distributions of the optimized partial charges of acetate shown on the diagonal (purple), with pairwise parameter correlations on the off-diagonals (density contours). Individual posterior samples are indicated as gray background dots. e: Protein parameterizations generated by sampling carboxylate charges from the acetate posterior distribution. f: Schematic depiction of the unbound (left) and Ca$^{2+}$-bound (right) states of N-cTnC used to compute the binding free energy. g: Computed Ca$^{2+}$ binding free energies for N-cTnC as a function of the sampled carboxyl oxygen charge with its marginal posterior (purple). Points are colored by the dipole moment of the CH$_2$COO$^-$ fragment. Bullets show results from the force fields developed in this work, while triangles represent values from CHARMM36, CHARMM36-nbfix, and ProsECCo75. The experimental $\Delta G_\text{bind}$ (–28.6 kJ/mol) is indicated by the dashed black line.
  • Figure 5: The thermodynamic cycle used to calculate the standard binding free energy of Ca$^{2+}$ to the EF-hand loop of the regulatory domain of troponin. The protein is shown in blue, Ca$^{2+}$ in orange, and restraints in pink. A filled orange circle indicates Ca$^{2+}$ fully interacting (coupled) with its environment, while an open circle denotes the decoupled state. The ink square represents the imposed constraints.