Table of Contents
Fetching ...

MLIP Arena: Advancing Fairness and Transparency in Machine Learning Interatomic Potentials via an Open, Accessible Benchmark Platform

Yuan Chiang, Tobias Kreiman, Christine Zhang, Matthew C. Kuner, Elizabeth Weaver, Ishan Amin, Hyunsoo Park, Yunsung Lim, Jihan Kim, Daryl Chrzan, Aron Walsh, Samuel M. Blau, Mark Asta, Aditi S. Krishnapriyan

TL;DR

MLIP Arena tackles the limitations of error-driven benchmarks by introducing physics-aware, open benchmarking for machine-learning interatomic potentials. It defines four task families—off-equilibrium asymptotics, MD stability and reactivity, robustness to distribution shifts, and thermodynamic phenomena—evaluating models with metrics that emphasize physical consistency, energy conservation, and symmetry. The study reveals nuanced failure modes: top bulk-trackers may underperform on pair interactions, non-conservative force predictions can drift under distribution shifts, and dynamical properties like vacancy migration and MOF adsorption expose weaknesses not captured by standard MAE metrics. By providing reproducible workflows, an online leaderboard, and diverse case studies, MLIP Arena offers a principled path to develop MLIPs that are accurate, efficient, and physically reliable for real-world atomistic modeling.

Abstract

Machine learning interatomic potentials (MLIPs) have revolutionized molecular and materials modeling, but existing benchmarks suffer from data leakage, limited transferability, and an over-reliance on error-based metrics tied to specific density functional theory (DFT) references. We introduce MLIP Arena, a benchmark platform that evaluates force field performance based on physics awareness, chemical reactivity, stability under extreme conditions, and predictive capabilities for thermodynamic properties and physical phenomena. By moving beyond static DFT references and revealing the important failure modes of current foundation MLIPs in real-world settings, MLIP Arena provides a reproducible framework to guide the next-generation MLIP development toward improved predictive accuracy and runtime efficiency while maintaining physical consistency. The Python package and online leaderboard are available at https://github.com/atomind-ai/mlip-arena.

MLIP Arena: Advancing Fairness and Transparency in Machine Learning Interatomic Potentials via an Open, Accessible Benchmark Platform

TL;DR

MLIP Arena tackles the limitations of error-driven benchmarks by introducing physics-aware, open benchmarking for machine-learning interatomic potentials. It defines four task families—off-equilibrium asymptotics, MD stability and reactivity, robustness to distribution shifts, and thermodynamic phenomena—evaluating models with metrics that emphasize physical consistency, energy conservation, and symmetry. The study reveals nuanced failure modes: top bulk-trackers may underperform on pair interactions, non-conservative force predictions can drift under distribution shifts, and dynamical properties like vacancy migration and MOF adsorption expose weaknesses not captured by standard MAE metrics. By providing reproducible workflows, an online leaderboard, and diverse case studies, MLIP Arena offers a principled path to develop MLIPs that are accurate, efficient, and physically reliable for real-world atomistic modeling.

Abstract

Machine learning interatomic potentials (MLIPs) have revolutionized molecular and materials modeling, but existing benchmarks suffer from data leakage, limited transferability, and an over-reliance on error-based metrics tied to specific density functional theory (DFT) references. We introduce MLIP Arena, a benchmark platform that evaluates force field performance based on physics awareness, chemical reactivity, stability under extreme conditions, and predictive capabilities for thermodynamic properties and physical phenomena. By moving beyond static DFT references and revealing the important failure modes of current foundation MLIPs in real-world settings, MLIP Arena provides a reproducible framework to guide the next-generation MLIP development toward improved predictive accuracy and runtime efficiency while maintaining physical consistency. The Python package and online leaderboard are available at https://github.com/atomind-ai/mlip-arena.

Paper Structure

This paper contains 51 sections, 19 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Overview of MLIP Arena. Four benchmark categories beyond error-based regression metrics provide actionable insights agnostic to underlying model architecture and DFT reference. Tasks are defined as Prefect (https://www.prefect.io/) workflows to enable advanced task caching, chaining, and parallel/concurrent execution on HPC. Atomic simulation environment (ASE) larsen2017atomic calculator and database are used. Codebase (https://github.com/atomind-ai/mlip-arena) and online leaderboard on Hugging Face Space (https://huggingface.co/spaces/atomind/mlip-arena) are available.
  • Figure 2: EOS benchmark on 1,000 WBM structures wang_predicting_2021. The reduced relative energy, $\frac{\Delta E}{BV_0}$, is normalized by the bulk modulus $B$ and equilibrium volume $V_0$ through a rearrangement of the Birch–Murnaghan EOS (\ref{['eq:bm-eos-rg']}). Color indicates the EOS curve of each crystal structure. The number of valid predictions for each model is shown after the model name.
  • Figure 3: MD stability on RM24 structures. For NVT (\ref{['fig:stability-nvt']}), we perform Nosé-Hoover thermostats with linearly increasing temperature from 300K to 3000K. The number of valid trajectories and the scaling of MD steps per second (SPS) with the number of atoms $N$ are shown. For NPT (\ref{['fig:stability-npt']}), Nosé-Hoover thermostats is performed with an additional pressure ramp from 0GPa to 500GPa. The size of each point represents the valid steps along each valid trajectory. The power law $\text{SPS} = a N^b$ is used to determine the asymptotic performance of MLIPs (solid line). First 120 structures from RM24 are used for NVT, and first 80 structures are used for NPT. The target length of each trajectory is 10ps. cuEquivariance kernel was disabled for MACE family models.
  • Figure 4: Energy conservation under distribution shift. Energy deviation is calculated for each sliding window during NVE MD simulations for 5ps. Differential entropy of the structure in the middle of the window is calculated, and the energy deviation from the start to the end of the window is recorded. We report 95% confidence interval error bars and a line of best fit. The order in which windows appear during the simulation is annotated by the number on each point. For direct force prediction models, the simulated trajectories become increasingly surprising over time, as shown by the monotonically increasing numbers from left to right.
  • Figure 5: NEB profiles of vacancy migration in FCC (a) and HCP (b) elemental crystals. All path lengths are normalized to 1, and all energies are normalized by PBE vacancy migration energy barrier $E_\text{vm}^\text{PBE}$ as given in angsten2014elemental. Number of missing predictions, average path asymmetry, and MAPE of maximum energy barrier are annotated on top left.
  • ...and 12 more figures