Table of Contents
Fetching ...

aims-PAX: Parallel Active eXploration for the automated construction of Machine Learning Force Fields

Tobias Henkes, Shubham Sharma, Alexandre Tkatchenko, Mariana Rossi, Igor Poltavskyi

TL;DR

aims-PAX addresses the data-inefficiency of ML force-field development by pairing automated initial data generation with parallel multi-trajectory active learning. It integrates GP-MLFFs, the MACE framework, and the FHI-aims DFT pipeline under a scalable Parsl-based workload manager, enabling rapid, transferable model construction with minimal human intervention. Across a flexible peptide, MD17 small molecules, solvated paracetamol, and CsPbI3 perovskite benchmarks, aims-PAX delivers comparable accuracy to larger curated datasets while reducing DFT labeling and training time by orders of magnitude. The framework thus provides a scalable, versatile platform for automated, data-efficient atomistic simulations applicable to both academic and industrial settings.

Abstract

Recent advances in machine learning force fields (MLFF) have significantly extended the reach of atomistic simulations. Continuous progress in this field requires reliable reference datasets, accurate MLFF architectures, and efficient active learning strategies to enable robust modeling of complex molecular and material systems. Here we introduce aims-PAX, an expedited, multi-trajectory active learning framework that streamlines the development of stable and accurate MLFFs. Designed for a wide range of researchers, aims-PAX offers a modular, high-performance workflow that couples diversified sampling with scalable training across CPU and GPU architectures. Integrated with the widely used ab initio code FHI-aims, the framework supports state-of-the-art ML models and dataset generation using general-purpose (or "foundational") force-fields for rapid deployment in diverse systems. We demonstrate the capabilities of aims-PAX in various challenging tasks: creating datasets and models for highly flexible peptides, multiple organic molecules at once, explicitly solvated molecules, and for efficiently handling computationally demanding systems such as the CsPbI$_3$ perovskite. We show that aims-PAX achieves a reduction of up to three orders of magnitude in the number of required reference calculations, automatically selects challenging systems within a given chemical space, facilitates simulation of solvated molecules with more than thousand atoms, while enabling a ten-fold speedup in active-learning time through optimized resource utilization. This positions aims-PAX as a powerful and versatile platform for next-generation atomistic simulations in both academic and industrial settings.

aims-PAX: Parallel Active eXploration for the automated construction of Machine Learning Force Fields

TL;DR

aims-PAX addresses the data-inefficiency of ML force-field development by pairing automated initial data generation with parallel multi-trajectory active learning. It integrates GP-MLFFs, the MACE framework, and the FHI-aims DFT pipeline under a scalable Parsl-based workload manager, enabling rapid, transferable model construction with minimal human intervention. Across a flexible peptide, MD17 small molecules, solvated paracetamol, and CsPbI3 perovskite benchmarks, aims-PAX delivers comparable accuracy to larger curated datasets while reducing DFT labeling and training time by orders of magnitude. The framework thus provides a scalable, versatile platform for automated, data-efficient atomistic simulations applicable to both academic and industrial settings.

Abstract

Recent advances in machine learning force fields (MLFF) have significantly extended the reach of atomistic simulations. Continuous progress in this field requires reliable reference datasets, accurate MLFF architectures, and efficient active learning strategies to enable robust modeling of complex molecular and material systems. Here we introduce aims-PAX, an expedited, multi-trajectory active learning framework that streamlines the development of stable and accurate MLFFs. Designed for a wide range of researchers, aims-PAX offers a modular, high-performance workflow that couples diversified sampling with scalable training across CPU and GPU architectures. Integrated with the widely used ab initio code FHI-aims, the framework supports state-of-the-art ML models and dataset generation using general-purpose (or "foundational") force-fields for rapid deployment in diverse systems. We demonstrate the capabilities of aims-PAX in various challenging tasks: creating datasets and models for highly flexible peptides, multiple organic molecules at once, explicitly solvated molecules, and for efficiently handling computationally demanding systems such as the CsPbI perovskite. We show that aims-PAX achieves a reduction of up to three orders of magnitude in the number of required reference calculations, automatically selects challenging systems within a given chemical space, facilitates simulation of solvated molecules with more than thousand atoms, while enabling a ten-fold speedup in active-learning time through optimized resource utilization. This positions aims-PAX as a powerful and versatile platform for next-generation atomistic simulations in both academic and industrial settings.

Paper Structure

This paper contains 10 sections, 2 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of the aims-PAX workflow: (a) Required input files: The first file () follows FHI-aims conventionsBlum2009 and contains the DFT settings. It is also possible to use different DFT settings per trajectory. The system's geometry, or initial geometries can either be inside a folder () or, in the case of a single geometry, in a file (). The file () contains MACEBatatia2022maceBatatia2025design model hyperparameters and the fourth () is an aims-PAX-specific file containing the IDG and AL settings. For the AL workflow, folders containing the initial datasets () and models () are required. (b) Initial dataset generation (IDG): Geometries are sampled using either DFT or a GP model, with DFT providing labels in both cases. Sampling continues until a (* user specified) criterion is met. (c) Parallelized active learning: The AL workflow requires input files, existing data, and models, which can be provided by the IDG procedure. Sampling occurs over multiple trajectories, triggering DFT calculations when an uncertainty threshold is exceeded. GPU-based ML tasks (orange) and CPU-based DFT tasks (blue) can run in parallel. AL is continued until a (* user specified) stopping condition is met. (d) Output: Models and collected data produced during AL (and IDG).
  • Figure 2: aims-PAX applied to the peptide Ac-F-A5-K: (a) Model uncertainty, actual maximum force error, uncertainty threshold and training set size as a function of MD steps throughout the AL procedure. (b) Actual maximum force error vs. model uncertainty with Pearson correlation coefficient over the whole AL workflow. A linear fit is shown as a guide to the eye. (c) Pearson correlation coefficient and training set size over multiple segments of the AL workflow for $n=3$ trajectories that were used for sampling. (d) Ramachandran plot for selected dihedral angles (see f) acquired with a model used in the TEA challengetea_1tea_2 (left) and ours, acquired using aims-PAX (right). Relative populations of highlighted clusters are given in bold font (black) and the blue number in the bottom left corner of each plot indicates the percentage of configurations from the MD trajectories assigned to a cluster.tea_1tea_2 (e) Number of geometries in the training set (left axis and bars) and number of required reference calculations for the dataset creation (right axis and bars) using a manual approach (as done in the TEA challenge,tea_1tea_2 green) and aims-PAX (orange) (f) Structure of Ac-F-A5-K including highlighting of relevant dihedral angles A,B and C.
  • Figure 3: Creation of a transferable MLFF viaaims-PAX through astute sampling: Number of data points in the training sets of the MLFF acquired using aims-PAX. The data is split up in points attained through the initial dataset generation (yellow) and the active learning itself (green). The model trained from scratch through a manual data curation approach uses 100 points for each chemical species (black dashed line).
  • Figure 4: aims-PAX used for creating a model capable of modeling explicit solvation: a) Vibrational density of states (VDoS) for paracetamol in the gas phase at 300 K acquired from the velocity autocorrelation function using the MLFF (solid black line) compared to the vibrational frequencies acquired within the harmonic approximation using said MLFF (tall blue vertical lines) and DFT (short deep orange lines). b) Depiction of paracetamol with highlighted atoms that define the three dihedral angles analyzed in this work ($\tau_1$ in magenta, $\tau_2$ in orange, and $\tau_3$ in green) as well as the definition of $d_{OH}$ and marking of carbons three and five used in c) of the same figure. c) Newman projectionnewman_proj along $\tau_2$ of paracetamol for the cases $\tau_2=0^{\circ}$ and $\tau_2=39^{\circ}$ corresponding to the maxima in e). d) Snapshot of an MD trajectory of bulk water and the oxygen-oxygen radial distribution function obtained from simulations run by the MLFF acquired from aims-PAX and AIMD using PBEwater_rdf_pbe. e) Histogram and associated kernel density estimation of dihedral angles $\tau_1$, $\tau_2$, and $\tau_3$ from simulations of paracetamol in gas phase and explicit water. Simulations were run using the MLFF acquired through aims-PAX, and a snapshot of the simulation with solvent is depicted.
  • Figure 5: Speedup through parallelized active learning: Wall-clock runtime in hours as a function of the number of available CPU nodes for aims-PAX using Parsl applied to the pervoskite CsPbI3.