Table of Contents
Fetching ...

AI-Driven Expansion and Application of the Alexandria Database

Théo Cavignac, Jonathan Schmidt, Pierre-Paul De Breuck, Antoine Loew, Tiago F. T. Cerqueira, Hai-Chen Wang, Anton Bochkarev, Yury Lysogorskiy, Aldo H. Romero, Ralf Drautz, Silvana Botti, Miguel A. L. Marques

TL;DR

This work substantially expands the Alexandria materials database and presents a fully open, multi-stage ML-driven workflow for discovering near-stable crystalline compounds. By integrating Matra-Genoa for generation, Orb-v2 for fast pre-relaxation, and ALIGNN for accurate hull-distance prediction prior to DFT, the authors achieve a 99% success rate in identifying compounds within 100 meV/atom of stability, generating 119 million candidates and adding 1.3 million DFT-validated entries (74k on the hull). The dataset grows to 5.8 million DFT structures with rich analyses of space groups, coordination environments, and phase-network connectivity, along with a new sAlex25 subset for training universal force fields and GRACE-based improvements to uMLIPs. All data, models, and workflows are openly released, enabling reproducibility and community-driven advancement in data-driven materials discovery.

Abstract

We present a novel multi-stage workflow for computational materials discovery that achieves a 99% success rate in identifying compounds within 100 meV/atom of thermodynamic stability, with a threefold improvement over previous approaches. By combining the Matra-Genoa generative model, Orb-v2 universal machine learning interatomic potential, and ALIGNN graph neural network for energy prediction, we generated 119 million candidate structures and added 1.3 million DFT-validated compounds to the ALEXANDRIA database, including 74 thousand new stable materials. The expanded ALEXANDRIA database now contains 5.8 million structures with 175 thousand compounds on the convex hull. Predicted structural disorder rates (37-43%) match experimental databases, unlike other recent AI-generated datasets. Analysis reveals fundamental patterns in space group distributions, coordination environments, and phase stability networks, including sub-linear scaling of convex hull connectivity. We release the complete dataset, including sAlex25 with 14 million out-of-equilibrium structures containing forces and stresses for training universal force fields. We demonstrate that fine-tuning a GRACE model on this data improves benchmark accuracy. All data, models, and workflows are freely available under Creative Commons licenses.

AI-Driven Expansion and Application of the Alexandria Database

TL;DR

This work substantially expands the Alexandria materials database and presents a fully open, multi-stage ML-driven workflow for discovering near-stable crystalline compounds. By integrating Matra-Genoa for generation, Orb-v2 for fast pre-relaxation, and ALIGNN for accurate hull-distance prediction prior to DFT, the authors achieve a 99% success rate in identifying compounds within 100 meV/atom of stability, generating 119 million candidates and adding 1.3 million DFT-validated entries (74k on the hull). The dataset grows to 5.8 million DFT structures with rich analyses of space groups, coordination environments, and phase-network connectivity, along with a new sAlex25 subset for training universal force fields and GRACE-based improvements to uMLIPs. All data, models, and workflows are openly released, enabling reproducibility and community-driven advancement in data-driven materials discovery.

Abstract

We present a novel multi-stage workflow for computational materials discovery that achieves a 99% success rate in identifying compounds within 100 meV/atom of thermodynamic stability, with a threefold improvement over previous approaches. By combining the Matra-Genoa generative model, Orb-v2 universal machine learning interatomic potential, and ALIGNN graph neural network for energy prediction, we generated 119 million candidate structures and added 1.3 million DFT-validated compounds to the ALEXANDRIA database, including 74 thousand new stable materials. The expanded ALEXANDRIA database now contains 5.8 million structures with 175 thousand compounds on the convex hull. Predicted structural disorder rates (37-43%) match experimental databases, unlike other recent AI-generated datasets. Analysis reveals fundamental patterns in space group distributions, coordination environments, and phase stability networks, including sub-linear scaling of convex hull connectivity. We release the complete dataset, including sAlex25 with 14 million out-of-equilibrium structures containing forces and stresses for training universal force fields. We demonstrate that fine-tuning a GRACE model on this data improves benchmark accuracy. All data, models, and workflows are freely available under Creative Commons licenses.

Paper Structure

This paper contains 16 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Materials discovery workflow used in Alexandria
  • Figure 2: Histograms of the distances to the convex hull of the different datasets described in the text. The histograms are normalized for easier comparison.
  • Figure 3: Evolution of the number of structures introduced to the database through time, in relation to their distance to the hull.
  • Figure 4: Elemental distribution within the database. Each cell of the periodic table indicates the fraction of materials containing a given element, relative to all materials on the convex hull (upper left) and to the entire database (lower right).
  • Figure 5: Space group distribution for of structures on the convex hull.
  • ...and 3 more figures