Forces are not Enough: Benchmark and Critical Evaluation for Machine Learning Force Fields with Molecular Simulations

Xiang Fu; Zhenghao Wu; Wujie Wang; Tian Xie; Sinan Keten; Rafael Gomez-Bombarelli; Tommi Jaakkola

Forces are not Enough: Benchmark and Critical Evaluation for Machine Learning Force Fields with Molecular Simulations

Xiang Fu, Zhenghao Wu, Wujie Wang, Tian Xie, Sinan Keten, Rafael Gomez-Bombarelli, Tommi Jaakkola

TL;DR

This work addresses the gap between force/energy prediction accuracy and the real-world utility of ML force fields in long-timescale MD simulations. It introduces a diverse benchmark suite with physically meaningful observables (RDF/h(r), diffusivity, FES) and a stability criterion, enabling evaluation beyond force errors. Through comprehensive experiments on MD17, water, alanine dipeptide, and LiPS, the study reveals that stability, data coverage, and energy-conservation bias critically shape MD performance, with NequIP often achieving the best balance when stable. The open-source framework and datasets aim to steer future research toward robust, simulation-ready ML force fields that can reliably reproduce macroscopic observables while remaining computationally efficient.

Abstract

Molecular dynamics (MD) simulation techniques are widely used for various natural science applications. Increasingly, machine learning (ML) force field (FF) models begin to replace ab-initio simulations by predicting forces directly from atomic structures. Despite significant progress in this area, such techniques are primarily benchmarked by their force/energy prediction errors, even though the practical use case would be to produce realistic MD trajectories. We aim to fill this gap by introducing a novel benchmark suite for learned MD simulation. We curate representative MD systems, including water, organic molecules, a peptide, and materials, and design evaluation metrics corresponding to the scientific objectives of respective systems. We benchmark a collection of state-of-the-art (SOTA) ML FF models and illustrate, in particular, how the commonly benchmarked force accuracy is not well aligned with relevant simulation metrics. We demonstrate when and how selected SOTA methods fail, along with offering directions for further improvement. Specifically, we identify stability as a key metric for ML models to improve. Our benchmark suite comes with a comprehensive open-source codebase for training and simulation with ML FFs to facilitate future work.

Forces are not Enough: Benchmark and Critical Evaluation for Machine Learning Force Fields with Molecular Simulations

TL;DR

Abstract

Paper Structure (10 sections, 5 equations, 18 figures, 12 tables)

This paper contains 10 sections, 5 equations, 18 figures, 12 tables.

Introduction
Preliminaries
Related Work
Datasets
Evaluation Metrics
Experiments
Failure Modes: Causes and Future Directions
Conclusion and Outlook
Dataset details
Experimental details

Figures (18)

Figure 1: (a) Results on simulating a water system with ML force fields. Models are sorted by force mean absolute error (MAE) in descending order. High stability and low radial distribution function (RDF) MAE are better. Performance in force error does not align with simulation-based metrics. (b) Force-only evaluation may not reveal key factors in simulating MD with a ML force fields. In this toy example, model 2 (green) has a lower force error but likely leads to unstable simulations due to extreme forces from local pathological behavior. (c) Illustrations of MD observables.
Figure 2: Visualization of the benchmarked systems. (a) MD17 molecules: Aspirin, Ethanol, Naphthalene, and Salicylic acid. (b) 64 water molecules. (c) 512 water molecules. (d) Alanine dipeptide. (e) LiPS.
Figure 3: Head-to-head comparison of force MAE vs. Stability and $h(r)$ MAE on MD17 molecules. Models are on the x-axis and are sorted according to force error in descending order. High stability and low $h(r)$ MAE mean better performance. Error bars indicate 95% confidence intervals.
Figure 4: Comparison of force MAE vs. stability (Left), force MAE vs. RDF MAE (Middle), and force MAE vs. Diffusivity MAE (Right) on the water benchmark. Each model is trained with three dataset sizes. The color of a point indicates the model identity, while the point size indicates the training dataset size (small: 1k, medium: 10k, large: 90k). Metrics infeasible to extract from certain model/dataset size (e.g., Diffusivity for unstable models) are not included.
Figure 5: (a, b) For each row, the first three panels show Ramachandran plots of the alanine dipeptide FES reconstructed from 5-ns reference vs. 5-ns NequIP/GemNet-T simulation, all using MetaDynamics. The last two panels show $F(\phi)$ and $F(\psi)$ of alanine dipeptide extracted from reference simulation vs. from NequIP simulation. The two rows result from different initialization, annotated as a yellow star. (c) ($\phi, \psi$) distribution of the alanine dipeptide training dataset. The six initialization points are marked with stars. NequIP fails to remain stable when the simulation starts from the point marked with black color. (d) Model-predicted total energy as a function of simulation time when simulating the LiPS system using the NVE ensemble. (e) On water-10k, stability does not improve when the time step is reduced for GemNet-T.
...and 13 more figures

Forces are not Enough: Benchmark and Critical Evaluation for Machine Learning Force Fields with Molecular Simulations

TL;DR

Abstract

Forces are not Enough: Benchmark and Critical Evaluation for Machine Learning Force Fields with Molecular Simulations

Authors

TL;DR

Abstract

Table of Contents

Figures (18)