Table of Contents
Fetching ...

Metatensor and metatomic: foundational libraries for interoperable atomistic machine learning

Filippo Bigi, Joseph W. Abbott, Philip Loche, Arslan Mazitov, Davide Tisi, Marcel F. Langer, Alexander Goscinski, Paolo Pegolo, Sanggyu Chong, Rohit Goswami, Pol Febrer, Sofiia Chorna, Matthias Kellner, Michele Ceriotti, Guillaume Fraux

TL;DR

The paper tackles interoperability barriers in atomistic ML by introducing metatensor, a gradient-aware, block-sparse data format, and metatomic, a portable ML-model interface. It defines a robust data container and a universal model-exchange protocol to enable seamless data/model sharing across diverse simulation engines. A modular ecosystem (metatrain, featomic, torch-spex, torch-pme, vesin, sphericart) and example models (PET-MAD, ShiftML, FlashMD) demonstrate end-to-end workflows from training to deployment in LAMMPS, ASE, i-PI, PLUMED, and beyond. The results show minimal runtime overhead for metatomic in production-like runs and reveal broad applicability across short- and long-range interactions, collective-variable workflows, and quantum-sampled simulations.

Abstract

Incorporation of machine learning (ML) techniques into atomic-scale modeling has proven to be an extremely effective strategy to improve the accuracy and reduce the computational cost of simulations. It also entails conceptual and practical challenges, as it involves combining very different mathematical foundations, as well as software ecosystems that are very well developed in their own right, but do not share many commonalities. To address these issues and facilitate the adoption of ML in atomistic simulations, we introduce two dedicated software libraries. The first one, metatensor, provides multi-platform and multi-language storage and manipulation of arrays with many potentially sparse indices, designed from the ground up for atomistic ML applications. By combining the actual values with metadata that describes their nature and that facilitates the handling of geometric information and gradients with respect to the atomic positions, metatensor provides a common framework to enable data sharing between ML software -- typically written in Python -- and established atomistic modeling tools -- typically written in Fortran, C or C++. The second library, metatomic, provides an interface to store an atomistic ML model and metadata about this model in a portable way, facilitating the implementation, training and distribution of models, and their use across different simulation packages. We showcase a growing ecosystem of tools, including low-level libraries, training utilities, and interfaces with existing software packages that demonstrate the effectiveness of metatensor and metatomic in bridging the gap between traditional simulation software and modern ML frameworks.

Metatensor and metatomic: foundational libraries for interoperable atomistic machine learning

TL;DR

The paper tackles interoperability barriers in atomistic ML by introducing metatensor, a gradient-aware, block-sparse data format, and metatomic, a portable ML-model interface. It defines a robust data container and a universal model-exchange protocol to enable seamless data/model sharing across diverse simulation engines. A modular ecosystem (metatrain, featomic, torch-spex, torch-pme, vesin, sphericart) and example models (PET-MAD, ShiftML, FlashMD) demonstrate end-to-end workflows from training to deployment in LAMMPS, ASE, i-PI, PLUMED, and beyond. The results show minimal runtime overhead for metatomic in production-like runs and reveal broad applicability across short- and long-range interactions, collective-variable workflows, and quantum-sampled simulations.

Abstract

Incorporation of machine learning (ML) techniques into atomic-scale modeling has proven to be an extremely effective strategy to improve the accuracy and reduce the computational cost of simulations. It also entails conceptual and practical challenges, as it involves combining very different mathematical foundations, as well as software ecosystems that are very well developed in their own right, but do not share many commonalities. To address these issues and facilitate the adoption of ML in atomistic simulations, we introduce two dedicated software libraries. The first one, metatensor, provides multi-platform and multi-language storage and manipulation of arrays with many potentially sparse indices, designed from the ground up for atomistic ML applications. By combining the actual values with metadata that describes their nature and that facilitates the handling of geometric information and gradients with respect to the atomic positions, metatensor provides a common framework to enable data sharing between ML software -- typically written in Python -- and established atomistic modeling tools -- typically written in Fortran, C or C++. The second library, metatomic, provides an interface to store an atomistic ML model and metadata about this model in a portable way, facilitating the implementation, training and distribution of models, and their use across different simulation packages. We showcase a growing ecosystem of tools, including low-level libraries, training utilities, and interfaces with existing software packages that demonstrate the effectiveness of metatensor and metatomic in bridging the gap between traditional simulation software and modern ML frameworks.

Paper Structure

This paper contains 30 sections, 17 figures, 1 table.

Figures (17)

  • Figure 1: Schematic illustration of the three primary objects in metatensor. (1) A TensorMap is a key/value map that acts as a block-sparse storage format, and forms the highest-level object. It groups multiple TensorBlocks and (optionally) their associated gradient TensorBlocks (in curly brackets) into a complete representation, each indexed by an entry in the keysLabels. (2) Each TensorBlock is comprised of values -- i.e. dense floating-point data -- and its corresponding metadata in the samples, components, and propertiesLabels. (3) Labels are used to store metadata, with named dimensions and unique row entries. Shown are one set of Labels corresponding to the keys of the TensorMap (with dimensions "key_1" and "key_2"), and three set of Labels corresponding to the samples ("system" and "atom" dimensions), components (a single "mu" dimension), and properties ("type_1", "n_1", and "type_2" dimensions) of a TensorBlock.
  • Figure 2: Examples of data stored in TensorMap objects. a) The atomization energy, a scalar, is stored in a single TensorBlock. b) The gradient of the energy with respect to the atomic positions, a per-atom Cartesian tensor, can be stored in a gradient TensorBlock associated with the energy block from a). c) The molecular dipole moment, a per-system Cartesian tensor, is also stored in a single block. d) The atom-centered basis decomposition of the electron density, a spherical tensor on an atomic basis, can be stored using the TensorMap block-sparse format with each block corresponding to irreducible representations of the O(3) group for each atom type.
  • Figure 3: An example of a series of transformations using metatensor-operations and metatensor-learn, for the task of transforming an equivariant descriptor into a predicted molecular dipole moment. (1) An equivariant descriptor is computed for some input systems (or "frames"), for example a $\lambda$-SOAP using the featomic.EquivariantPowerSpectrum calculator, see Section \ref{['sec:featomic']}. The descriptor is separated into blocks matching the different O(3) symmetries of the target property (lambda=1 and sigma=1) and the central atom type. (2) keys_to_samples is used to group together blocks with the same symmetries, making the descriptor dense along the atom type. (3) A neural network created using metatensor-learn utilities transforms the features of the descriptor into a per-atom dipole prediction. (4) sum_over_samples is used to aggregate local predictions to form the per-structure total dipole prediction. (5) the prediction is transformed from the spherical to the Cartesian basis.
  • Figure 4: Minimal pseudo-code example showing how to wrap a pretrained PyTorch model and export it as a metatomic model. The wrapped model requests a neighbor list that will be provided by the simulation engine.
  • Figure 5: Schematic illustration of the information flow between the ML model and simulation engine in metatomic. The model contains the serialized code, the weights of the ML model, metadata about the model itself (authors, article references) and metadata about what the model can do. A simulation engine will query the model about what inputs it requires --- including any required neighbor lists --- and which outputs it can produce. The engine can then prepare the inputs and run the model to get the output using a unified interface.
  • ...and 12 more figures