Table of Contents
Fetching ...

Automated Machine Learning Pipeline: Large Language Models-Assisted Automated Dataset Generation for Training Machine-Learned Interatomic Potentials

Adam Lahouari, Jutta Rogal, Mark E. Tuckerman

TL;DR

AMLP addresses the challenge of building reliable MLIPs by automating dataset generation, code selection, and validation using a multi-agent LLM system within the MACE framework. The pipeline converts raw structure inputs into QM-ready workflows, generates AIMD-based training data, and trains and validates MLIPs with ASE-based analyses. In the acridine polymorph case, AMLP achieves sub-Å geometries and near-DFT accuracy for energies and forces, with robust energy conservation and meaningful dynamical validation across temperatures, while revealing limitations in transferability to unseen high-temperature forms. The work outlines a scalable path to automated MLIP development and points to future extensions to other architectures and active-learning orchestration.

Abstract

Machine learning interatomic potentials (MLIPs) have become powerful tools to extend molecular simulations beyond the limits of quantum methods, offering near-quantum accuracy at much lower computational cost. Yet, developing reliable MLIPs remains difficult because it requires generating high-quality datasets, preprocessing atomic structures, and carefully training and validating models. In this work, we introduce an Automated Machine Learning Pipeline (AMLP) that unifies the entire workflow from dataset creation to model validation. AMLP employs large-language-model agents to assist with electronic-structure code selection, input preparation, and output conversion, while its analysis suite (AMLP-Analysis), based on ASE supports a range of molecular simulations. The pipeline is built on the MACE architecture and validated on acridine polymorphs, where, with a straightforward fine-tuning of a foundation model, mean absolute errors of ~1.7 meV/atom in energies and ~7.0 meV/Å in forces are achieved. The fitted MLIP reproduces DFT geometries with sub-Å accuracy and demonstrates stability during molecular dynamics simulations in the microcanonical and canonical ensembles.

Automated Machine Learning Pipeline: Large Language Models-Assisted Automated Dataset Generation for Training Machine-Learned Interatomic Potentials

TL;DR

AMLP addresses the challenge of building reliable MLIPs by automating dataset generation, code selection, and validation using a multi-agent LLM system within the MACE framework. The pipeline converts raw structure inputs into QM-ready workflows, generates AIMD-based training data, and trains and validates MLIPs with ASE-based analyses. In the acridine polymorph case, AMLP achieves sub-Å geometries and near-DFT accuracy for energies and forces, with robust energy conservation and meaningful dynamical validation across temperatures, while revealing limitations in transferability to unseen high-temperature forms. The work outlines a scalable path to automated MLIP development and points to future extensions to other architectures and active-learning orchestration.

Abstract

Machine learning interatomic potentials (MLIPs) have become powerful tools to extend molecular simulations beyond the limits of quantum methods, offering near-quantum accuracy at much lower computational cost. Yet, developing reliable MLIPs remains difficult because it requires generating high-quality datasets, preprocessing atomic structures, and carefully training and validating models. In this work, we introduce an Automated Machine Learning Pipeline (AMLP) that unifies the entire workflow from dataset creation to model validation. AMLP employs large-language-model agents to assist with electronic-structure code selection, input preparation, and output conversion, while its analysis suite (AMLP-Analysis), based on ASE supports a range of molecular simulations. The pipeline is built on the MACE architecture and validated on acridine polymorphs, where, with a straightforward fine-tuning of a foundation model, mean absolute errors of ~1.7 meV/atom in energies and ~7.0 meV/Å in forces are achieved. The fitted MLIP reproduces DFT geometries with sub-Å accuracy and demonstrates stability during molecular dynamics simulations in the microcanonical and canonical ensembles.

Paper Structure

This paper contains 13 sections, 5 equations, 20 figures, 10 tables, 1 algorithm.

Figures (20)

  • Figure 1: Roadmap of the Automated Machine Learning Potential (AMLP) framework. The workflow begins with structural inputs (.xyz, .cif), which are processed by AI supervised agents to extract relevant literature data and recommend DFT parameters for different codes (e.g., CP2K, Gaussian, VASP). Automated input generation produces ready-to-use files for geometry optimization, cell optimization, single point calculation and AIMD. Simulation outputs are curated into .json datasets containing energies, forces, and structural information, which are then used to train MLIPs. Validation via ASE includes geometry and cell optimization, MD simulations, RDF calculations. This roadmap illustrates the fully automated pipeline from raw input structures to validated MLIPs.
  • Figure 2: (a) $\Delta E$ denotes the difference in lattice energy between each polymorph and the most stable one (ACRDIN05).(b) Comparison between experimental (blue) and DFT optimized (yellow) unit-cell volumes of various acridine polymorphs. The red line corresponds to $\Delta V$.
  • Figure 3: Distribution of the different structures coming from the DFT with (a) the energy per atom and (b) the maximum forces.
  • Figure 4: (a) Mean Absolute Error for the three different committees used. Comparing the MAE for energies in meV/atom (in blue) and for the forces in meV/Å (in yellow)
  • Figure 5: Relative lattice energies ($\Delta E_{\mathrm{lattice}}$) of acridine polymorphs predicted by the MPA–MACE foundation model and by fine-tuned committee models (MACE-A/B/C), compared against fully relaxed DFT references. (a): single-point evaluations on fixed geometries; (b): energies after geometry optimization.
  • ...and 15 more figures