MLIP Arena: Advancing Fairness and Transparency in Machine Learning Interatomic Potentials via an Open, Accessible Benchmark Platform
Yuan Chiang, Tobias Kreiman, Christine Zhang, Matthew C. Kuner, Elizabeth Weaver, Ishan Amin, Hyunsoo Park, Yunsung Lim, Jihan Kim, Daryl Chrzan, Aron Walsh, Samuel M. Blau, Mark Asta, Aditi S. Krishnapriyan
TL;DR
MLIP Arena tackles the limitations of error-driven benchmarks by introducing physics-aware, open benchmarking for machine-learning interatomic potentials. It defines four task families—off-equilibrium asymptotics, MD stability and reactivity, robustness to distribution shifts, and thermodynamic phenomena—evaluating models with metrics that emphasize physical consistency, energy conservation, and symmetry. The study reveals nuanced failure modes: top bulk-trackers may underperform on pair interactions, non-conservative force predictions can drift under distribution shifts, and dynamical properties like vacancy migration and MOF adsorption expose weaknesses not captured by standard MAE metrics. By providing reproducible workflows, an online leaderboard, and diverse case studies, MLIP Arena offers a principled path to develop MLIPs that are accurate, efficient, and physically reliable for real-world atomistic modeling.
Abstract
Machine learning interatomic potentials (MLIPs) have revolutionized molecular and materials modeling, but existing benchmarks suffer from data leakage, limited transferability, and an over-reliance on error-based metrics tied to specific density functional theory (DFT) references. We introduce MLIP Arena, a benchmark platform that evaluates force field performance based on physics awareness, chemical reactivity, stability under extreme conditions, and predictive capabilities for thermodynamic properties and physical phenomena. By moving beyond static DFT references and revealing the important failure modes of current foundation MLIPs in real-world settings, MLIP Arena provides a reproducible framework to guide the next-generation MLIP development toward improved predictive accuracy and runtime efficiency while maintaining physical consistency. The Python package and online leaderboard are available at https://github.com/atomind-ai/mlip-arena.
