Table of Contents
Fetching ...

Taming Multi-Domain, -Fidelity Data: Towards Foundation Models for Atomistic Scale Simulations

Tomoya Shiota, Kenji Ishihara, Tuan Minh Do, Toshio Mori, Wataru Mizukami

TL;DR

The paper tackles the challenge of building a universal atomistico model by introducing Total Energy Alignment (TEA), a two-step energy-harmonization protocol that unifies datasets from different quantum-chemical fidelities via Inner Core Energy Alignment and Atomization Energy Correction. This enables training a single, open-source MLIP, MACE-Osaka24, on a combined organic-inorganic dataset with competitive accuracies across molecular and crystalline systems. TEA demonstrates effective cross-domain alignment on QM9, molecular torsions, lattice constants, liquid water, and heterogeneous nanoparticle catalysts, illustrating the potential of foundation models in chemistry. By enabling data reuse without costly recalculations, TEA democratizes multi-domain model development and paves the way for more interoperable and scalable chemistry/ materials modeling tools.

Abstract

Machine learning interatomic potentials (MLIPs) are changing atomistic simulations in the field of chemistry and materials science. However, constructing a single universal MLIP that can accurately model molecular and crystalline systems remains challenging. A central obstacle is the integration of diverse datasets generated under different computational conditions. We present Total Energy Alignment (TEA), which is an approach that enables the seamless integration of heterogeneous quantum chemical datasets without redundant calculations. Using TEA, we trained MACE-Osaka24, the first open-source MLIP model based on a unified dataset covering molecular and crystalline systems. This universal model displays strong performances across diverse chemical systems, exhibiting similar or improved accuracies in predicting organic reaction barriers compared to those of specialized models, while effectively maintaining state-of-the-art accuracies for inorganic systems. These advancements pave the way for accelerated discoveries in the fields of chemistry and materials science via genuine foundation models for chemistry.

Taming Multi-Domain, -Fidelity Data: Towards Foundation Models for Atomistic Scale Simulations

TL;DR

The paper tackles the challenge of building a universal atomistico model by introducing Total Energy Alignment (TEA), a two-step energy-harmonization protocol that unifies datasets from different quantum-chemical fidelities via Inner Core Energy Alignment and Atomization Energy Correction. This enables training a single, open-source MLIP, MACE-Osaka24, on a combined organic-inorganic dataset with competitive accuracies across molecular and crystalline systems. TEA demonstrates effective cross-domain alignment on QM9, molecular torsions, lattice constants, liquid water, and heterogeneous nanoparticle catalysts, illustrating the potential of foundation models in chemistry. By enabling data reuse without costly recalculations, TEA democratizes multi-domain model development and paves the way for more interoperable and scalable chemistry/ materials modeling tools.

Abstract

Machine learning interatomic potentials (MLIPs) are changing atomistic simulations in the field of chemistry and materials science. However, constructing a single universal MLIP that can accurately model molecular and crystalline systems remains challenging. A central obstacle is the integration of diverse datasets generated under different computational conditions. We present Total Energy Alignment (TEA), which is an approach that enables the seamless integration of heterogeneous quantum chemical datasets without redundant calculations. Using TEA, we trained MACE-Osaka24, the first open-source MLIP model based on a unified dataset covering molecular and crystalline systems. This universal model displays strong performances across diverse chemical systems, exhibiting similar or improved accuracies in predicting organic reaction barriers compared to those of specialized models, while effectively maintaining state-of-the-art accuracies for inorganic systems. These advancements pave the way for accelerated discoveries in the fields of chemistry and materials science via genuine foundation models for chemistry.

Paper Structure

This paper contains 28 sections, 8 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: (a) Scatter plot comparing the total energies of about 143 000 QM9 geometries obtained with Method 1 (PBE/PW via VASP; “QM9VASP”) and Method 2 ($\omega$B97M/def2-TZVPPD via Psi4; “QM9Psi4”). The very poor correlation ($\mathrm{R}^2$ = –103.6, root mean square error (RMSE) = 11 156 eV) underscores the large systematic difference between the two levels of theory; marginal histograms are shown on the axes. (b) Same data after the first stage of Total Energy Alignment (TEA)—Inner Core Energy Alignment (ICEA)—is applied to the total energies of the QM9Psi4. (c) Total energies after the second stage of TEA—Atomization Energy Correction (AEC)—which brings the datasets into close agreement ($\mathrm{R}^2$ = 0.9965, RMSE = 0.839 eV). Insets in (b) and (c) enlarge the boxed regions. (d) Schematic potential-energy surfaces (PESs) for a representative molecule (benzene) calculated with Methods 1 (blue) and 2 (red), corresponding to one data point in (a). (e) Illustration of ICEA: for species with identical stoichiometry, ICEA acts as a constant vertical shift of the Method 2 PES. (f) Illustration of AEC: after ICEA, AEC rescales the shifted Method 2 PES by a factor a, yielding the fully aligned PES that matches the Method 1 reference.
  • Figure 2: (a) Optimized torsional potential energy surfaces (PESs) of dihedral torsion in a representative organic molecule of the biaryl torsion dataset lahey2020benchmarking shown on the right-hand side the of the figure. The results obtained using various Machine Learning Interatomic Potentials (MLIPs), including the SO3LR, MACE-MP-0, MACE-OFF23, and MACE-Osaka24 models, are compared alongside reference calculations performed using Psi4 ($\mathrm{\omega}$B97M-D3(BJ)), the VASP (PBE), and ORCA (CCSD(T1)*). The CCSD(T1)* values were obtained from the biaryl torsion benchmark lahey2020benchmarking. (b) Violin plot of the errors in the reaction energies, where the reaction energy is defined as the energy difference between the initial (IS) and final states (FS). The errors are calculated based on single-point energy calculations of the 10 073 organic reactions of the Transition1x dataset conducted using the MACE-MP-0, MACE-OFF23, and MACE-Osaka24 models. The results were compared to the single-point energies calculated at the $\omega$B97M-D3(BJ) level using Psi4 . The results obtained using the large and small models are respectively shown in darker and lighter colors. (c) Violin plot of the errors in the energy barriers, where the energy barrier is defined as the energy difference between the IS and TS. The results were compared to the single-point energies at the $\omega$B97M-D3(BJ) level of the 10 073 organic reactions of the Transition1x dataset, as calculated using Psi4 and the same models as those shown in (b). The lighter and darker colors represent the results obtained using the small and large models, respectively .
  • Figure 3: (a) Crystal structures and their representative materials used in the lattice constant benchmark shown in (b): Face-centered cubic (FCC, e.g., Ag, Pd), body-centered cubic (BCC, e.g., Li, Na), halite (e.g., NaCl), zinc blende (e.g., GaAs), and Diamond (e.g., C, Si). (b) Violin plot showing the errors in the lattice constants ($\mathrm{\AA}$) obtained using different models, including MACE-MP-0-small, MACE-MP-0-large, MACE-Osaka24-small, MACE-Osaka24-large, and M3GNet trained on the MPF.2021.2.8 dataset. The errors are calculated with respect to lattice constants optimized using the VASP with the PBE functional, employing the MPRelaxSet input provided by Pymatgen from the Materials Project. (c) Relative energy (eV/atom) of a diamond-structured Si crystal as a function of the lattice constant ($\mathrm{\AA}$), as predicted using MACE models (MP-0 and Osaka24 variants) and compared to that predicted via VASP calculations. The VASP calculations were performed using the MPStaticSet input provided by Pymatgen. (d) Radial distribution function (RDF, a.u.) of liquid water obtained via NVT simulations. The results obtained using the MACE-MP-0 and MACE-Osaka24 models with D3(BJ) corrections are shown, in addition to those obtained via classical MD simulations using the TIP3P and TIP4P/2005 water models.
  • Figure 4: (a) Structures of twenty equiatomic IrPdPtRhRu high-entropy alloy nanoparticles (HEA NPs) with 201 atoms obtained from PBE-level DFT geometry optimizations with VASP. (b) Schematic of CO adsorption at on-top sites of a HEA NP surface. The yellow hexagon highlights a target CO molecular adsorption sites. CO adsorption-energies were obtained from Ref. shiota2025lowering for 17 of the 19 on-top sites that PBE-level DFT calculations identified as stable. The example shows a CO molecule adsorbed in an on-top configuration on a Ru corner atom. (c) Benchmarking of the NP systems with both the small and large variants of the MACE-MP-0 and MACE-Osaka24 models. The upper violin plot shows the distribution of root mean square deviations (RMSDs) for the twenty equiatomic IrPdPtRhRu HEA NP structures in (a), while the lower plot presents the error distribution of CO on-top adsorption energies relative to PBE-level DFT calculations provided in Ref. shiota2025lowering.
  • Figure 5: Results of the total energy alignment (TEA) of different datasets. (a) Parity plot of the atomization energies of the QM9VASP and QM9ADF datasets, as calculated using the same PBE functional. (b) Parity plot of the total energies after applying Inner Core Energy Alignment (ICEA) and Atomization Energy Correction (AEC) to the QM9ADF dataset.
  • ...and 4 more figures