Learning Collective Variables with Synthetic Data Augmentation through Physics-Inspired Geodesic Interpolation
Soojung Yang, Juno Nam, Johannes C. B. Dietschreit, Rafael Gómez-Bombarelli
TL;DR
This work tackles the problem of learning expressive collective variables (CVs) for accelerating rare-event transitions in molecular dynamics by addressing the scarcity of true transition-state data. It introduces a simulation-free data-augmentation scheme based on geodesic interpolation on a Riemannian manifold of protein conformations, generating pseudo-transition-state configurations and a transition-progress parameter $t$ to supervise CV learning. The authors compare discriminant-analysis–based CVs with a novel regression-based approach (Reggeo) that uses $t$ as the target, showing improved accuracy in free-energy differences $\Delta F$ and PMFs when synthetic transition data are filtered appropriately. The approach reduces reliance on costly transition-path sampling, enables one-shot CV training, and provides a generalizable framework for other rare-event problems, albeit with caveats about the physical fidelity of pseudo-TSE data and out-of-distribution robustness.
Abstract
In molecular dynamics simulations, rare events, such as protein folding, are typically studied using enhanced sampling techniques, most of which are based on the definition of a collective variable (CV) along which acceleration occurs. Obtaining an expressive CV is crucial, but often hindered by the lack of information about the particular event, e.g., the transition from unfolded to folded conformation. We propose a simulation-free data augmentation strategy using physics-inspired metrics to generate geodesic interpolations resembling protein folding transitions, thereby improving sampling efficiency without true transition state samples. This new data can be used to improve the accuracy of classifier-based methods. Alternatively, a regression-based learning scheme for CV models can be adopted by leveraging the interpolation progress parameter.
