Learning Collective Variables with Synthetic Data Augmentation through Physics-Inspired Geodesic Interpolation

Soojung Yang; Juno Nam; Johannes C. B. Dietschreit; Rafael Gómez-Bombarelli

Learning Collective Variables with Synthetic Data Augmentation through Physics-Inspired Geodesic Interpolation

Soojung Yang, Juno Nam, Johannes C. B. Dietschreit, Rafael Gómez-Bombarelli

TL;DR

This work tackles the problem of learning expressive collective variables (CVs) for accelerating rare-event transitions in molecular dynamics by addressing the scarcity of true transition-state data. It introduces a simulation-free data-augmentation scheme based on geodesic interpolation on a Riemannian manifold of protein conformations, generating pseudo-transition-state configurations and a transition-progress parameter $t$ to supervise CV learning. The authors compare discriminant-analysis–based CVs with a novel regression-based approach (Reggeo) that uses $t$ as the target, showing improved accuracy in free-energy differences $\Delta F$ and PMFs when synthetic transition data are filtered appropriately. The approach reduces reliance on costly transition-path sampling, enables one-shot CV training, and provides a generalizable framework for other rare-event problems, albeit with caveats about the physical fidelity of pseudo-TSE data and out-of-distribution robustness.

Abstract

In molecular dynamics simulations, rare events, such as protein folding, are typically studied using enhanced sampling techniques, most of which are based on the definition of a collective variable (CV) along which acceleration occurs. Obtaining an expressive CV is crucial, but often hindered by the lack of information about the particular event, e.g., the transition from unfolded to folded conformation. We propose a simulation-free data augmentation strategy using physics-inspired metrics to generate geodesic interpolations resembling protein folding transitions, thereby improving sampling efficiency without true transition state samples. This new data can be used to improve the accuracy of classifier-based methods. Alternatively, a regression-based learning scheme for CV models can be adopted by leveraging the interpolation progress parameter.

Learning Collective Variables with Synthetic Data Augmentation through Physics-Inspired Geodesic Interpolation

TL;DR

to supervise CV learning. The authors compare discriminant-analysis–based CVs with a novel regression-based approach (Reggeo) that uses

as the target, showing improved accuracy in free-energy differences

and PMFs when synthetic transition data are filtered appropriately. The approach reduces reliance on costly transition-path sampling, enables one-shot CV training, and provides a generalizable framework for other rare-event problems, albeit with caveats about the physical fidelity of pseudo-TSE data and out-of-distribution robustness.

Abstract

Paper Structure (14 sections, 6 equations, 6 figures, 1 table)

This paper contains 14 sections, 6 equations, 6 figures, 1 table.

Introduction
Methods
Geodesic interpolation of protein conformations
Leveraging the interpolation parameter as an indicator of reaction progress
Chignolin as a model system
Transition-focused data augmentation
Machine learning collective variable models
Enhanced sampling and result analysis
Results and Discussion
Geodesic interpolation resembles state-state transitions
Data augmentation improves ML CV
Analysis of CV model predictions for sampled configurations
Current limitations of ML CVs
Conclusion

Figures (6)

Figure 1: Depiction of a common pipeline for data-driven CVs. Initially, only data from metastable states are available (left). Long and costly production runs are meaningful only when performed with a reliable CV (right). Top (previous methods): A trial CV is trained and then iteratively improved through enhanced sampling simulations, which generate more data of the transition between the metastable states. Bottom (ours): Geodesic interpolations are used to create synthetic TSE data, from which the CV can be trained in one shot, obviating the need for an iterative procedure.
Figure 2: Comparison of the unfolded--folded transition observed in a long reference trajectory and the corresponding transition generated through geodesic interpolation of the end points. (a) Overlay of interpolated structures (opaque) and the reference structures (transparent). The initial and final structures are identical. (b) Evolution of the donor--acceptor distances for the two key hydrogen bonds observed in the folded state, in respective colors. The dashed lines are the reference transition as a function of time, and the solid lines are the interpolated conformations as a function of parameter $t$. (c) Overlay of 5,000 interpolated samples with $t$ uniformly sampled over the range of [0, 1], displayed on the first two slow modes from TICA. The projection of the reference unbiased trajectory is shown in gray in the background. Low TIC 1 values (left side) correspond to the unfolded basin, while high values (right side) correspond to the folded basin. (d) Interpolation parameter $t$ correlates with the progress along the slowest mode (TIC 1), which describes the transition.
Figure 3: (a) Scheme for geodesic interpolation between unfolded and folded conformations. The interpolated structure $\mathbf{Z}$ corresponding to interpolation parameter $t$ is obtained using the geodesic in eq \ref{['eq:geodesic']}. (b) Procedure to estimate interpolation parameter $\hat{t}$ of a given intermediate conformation and sets of unfolded and folded conformations, which is given by a ratio of the minimum distances to unfolded and folded conformations (eq \ref{['eq:reverse_cal_t']}). (c) Parity plot comparing the true interpolation parameter $t$ with the reverse-calculated $\hat{t}$ for 5,000 interpolations, with $t$ uniformly sampled over the range [0, 1].
Figure 4: Convergence of the free energy difference ($\Delta F$) between folded and unfolded state and the PMF at the end of the simulation as sampled with WTM-eABF using each CV. Top panel: Evolution of the $\Delta F$ estimate over the course of the trajectory. Horizontal solid lines indicate a $\pm$ 4 kJ/mol range (chemical accuracy) around the $\Delta F$ value obtained from the long unbiased reference runs (dashed line). Bottom panel: Comparison of reference PMF obtained by projection of reference data (dotted line) and those obtained from 1 $\mu$s WTM-eABF simulations with the given CV. The shaded areas represent the standard deviation of the independent simulations, and the solid line is the mean.
Figure 5: Projections of the conformations from the unbiased reference trajectory (upper rows) and WTM-eABF enhanced sampling trajectory (lower rows) using different CV models onto the first two time-lagged independent components, colored based on the normalized CV value. The CV values were scaled such that the PMF minima of the unfolded and folded basins correspond to 0 and 1, respectively.
...and 1 more figures

Learning Collective Variables with Synthetic Data Augmentation through Physics-Inspired Geodesic Interpolation

TL;DR

Abstract

Learning Collective Variables with Synthetic Data Augmentation through Physics-Inspired Geodesic Interpolation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)