Table of Contents
Fetching ...

Energy-GNoME: A Living Database of Selected Materials for Energy Applications

Paolo De Angelis, Giovanni Trezza, Giulio Barletta, Pietro Asinari, Eliodoro Chiavazzo

TL;DR

This work presents Energy-GNoME, an AI-driven, living database framework that mines the expansive GNoME material space for energy-related compounds by integrating a specialized energy subset $M^E$ with a general-purpose MP database and the unexplored GNoME set $G$. A committee of AI-experts (classifiers) defines the energy-material region, while regressor ensembles predict key properties ($zT$, $E_g$, $ΔV_c$) for screened candidates, enabling efficient, bias-aware screening and continuous database growth through active learning. Across three case studies—thermoelectrics, perovskites, and battery cathodes—the protocol yields thousands of promising candidates (e.g., 7,530 thermoelectrics, 4,259 perovskites, 21,243 cathodes) and demonstrates robust predictive performance ($R^2$ in the ~0.7 range for regressors; AUC ~ 0.98 for AI-experts). The approach addresses extrapolation biases, accelerates materials discovery, and lays the groundwork for expanding the Energy-GNoME space to include sustainability and toxicity considerations, making it a practical tool for experimental and computational exploration in energy materials.

Abstract

Artificial Intelligence (AI) in materials science is driving significant advancements in the discovery of advanced materials for energy applications. The recent GNoME protocol identifies over 380,000 novel stable crystals. From this, we identify over 33,000 materials with potential as energy materials forming the Energy-GNoME database. Leveraging Machine Learning (ML) and Deep Learning (DL) tools, our protocol mitigates cross-domain data bias using feature spaces to identify potential candidates for thermoelectric materials, novel battery cathodes, and novel perovskites. Classifiers with both structural and compositional features identify domains of applicability, where we expect enhanced accuracy of the regressors. Such regressors are trained to predict key materials properties like, thermoelectric figure of merit (zT), band gap (Eg), and cathode voltage ($ΔV_c$). This method significantly narrows the pool of potential candidates, serving as an efficient guide for experimental and computational chemistry investigations and accelerating the discovery of materials suited for electricity generation, energy storage and conversion.

Energy-GNoME: A Living Database of Selected Materials for Energy Applications

TL;DR

This work presents Energy-GNoME, an AI-driven, living database framework that mines the expansive GNoME material space for energy-related compounds by integrating a specialized energy subset with a general-purpose MP database and the unexplored GNoME set . A committee of AI-experts (classifiers) defines the energy-material region, while regressor ensembles predict key properties (, , ) for screened candidates, enabling efficient, bias-aware screening and continuous database growth through active learning. Across three case studies—thermoelectrics, perovskites, and battery cathodes—the protocol yields thousands of promising candidates (e.g., 7,530 thermoelectrics, 4,259 perovskites, 21,243 cathodes) and demonstrates robust predictive performance ( in the ~0.7 range for regressors; AUC ~ 0.98 for AI-experts). The approach addresses extrapolation biases, accelerates materials discovery, and lays the groundwork for expanding the Energy-GNoME space to include sustainability and toxicity considerations, making it a practical tool for experimental and computational exploration in energy materials.

Abstract

Artificial Intelligence (AI) in materials science is driving significant advancements in the discovery of advanced materials for energy applications. The recent GNoME protocol identifies over 380,000 novel stable crystals. From this, we identify over 33,000 materials with potential as energy materials forming the Energy-GNoME database. Leveraging Machine Learning (ML) and Deep Learning (DL) tools, our protocol mitigates cross-domain data bias using feature spaces to identify potential candidates for thermoelectric materials, novel battery cathodes, and novel perovskites. Classifiers with both structural and compositional features identify domains of applicability, where we expect enhanced accuracy of the regressors. Such regressors are trained to predict key materials properties like, thermoelectric figure of merit (zT), band gap (Eg), and cathode voltage (). This method significantly narrows the pool of potential candidates, serving as an efficient guide for experimental and computational chemistry investigations and accelerating the discovery of materials suited for electricity generation, energy storage and conversion.

Paper Structure

This paper contains 17 sections, 14 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The schematic shows the protocol for creating the Energy-GNoME database, illustrating training (grey dashed line) and predictive (black solid line) phases. Training begins with the cyan database and ends with ML model evaluations (e.g., parity plots, ROC curves). Feature extraction depends on material storage in the "Energy" database $M^E$ and may use composition- or structure-based pipelines, as indicated by the OR switch symbol $\oplus$. The structure pipeline applies a graph representation, while the composition pipeline uses chemical descriptors, each feeding a committee of E(3)NN or GBDT. Concurrently, the screening pipeline (orange box) trains GBDT classifiers -- "AI-experts" -- to identify $M^E$-like materials. In prediction mode, screened GNoME materials ($y$) with over 50% likelihood $\left(P(y\in M) > 0.5\right)$ of matching $M^E$ materials ($x$) biases enter the regressor pipeline to predict properties. These candidates, with predicted properties, are added to the Energy-GNoME database, initiating a continuous active learning cycle (see magenta arrow).
  • Figure 2: Hexagonal plot of thermolesctrics candidate materials in the Energy-GNoME database. The hexagon colors represent material counts per region as indicated by the color bar. Density distributions, $\rho$, are shown on the plot's top and right, calculated using Gaussian KDE for the average AI-expert probability, $P$, and predicted figure of merit, $zT$. The thermoelectric performance was assessed across six temperatures, with combined $zT$ values displayed in a color-coded distribution on the right. The crystal structures above show three notable candidates among the top-ranked screened thermoelectric materials as determined by $R^{T}(y)$ (see Subsubsection \ref{['sssec:m-case-specific-thermoelectrics']}). Atom colors follow the extended CPK corey_molecular_1953 scheme by Jmol noauthor_jmol.
  • Figure 3: Hexagonal plot of perovskite candidates materials in the Energy-GNoME database. The hexagon colors represent material counts per region on a logarithmic scale, as indicated by the color bar. Density distributions, $\rho$, are shown on the plot's top and right, calculated using Gaussian KDE for the average AI-expert probability, $P$, and the predicted band gap, $E_g$. For $E_g$, results are displayed from regressors trained on (a) perovskite data alone and (b) an augmented dataset. The crystal structures above show three notable candidates among the top-ranked perovskite materials, as determined by $R^{P}(y)$ (see Subsubsection \ref{['sssec:m-case-specific-perovskite']}). Atom colors follow the extended CPK corey_molecular_1953 scheme by Jmol noauthor_jmol.
  • Figure 4: Candidates for battery cathode materials with various charge carriers. Li (a), Na (b), Mg (c), K (d), Ca (e), Cs (f), Al (g), Rb (h), and Y (i). Each candidate is represented as a point in the scatter plot, showing theoretical gravimetric capacity ($mAh\per g$, logarithmic scale) versus predicted average voltage difference ($V$) relative to pure element oxidation potential ($\mathrm{X/X^{n+}}$ with $\mathrm{X}$ being the working ion). Grey dashed hyperbolas indicate the predicted gravimetric energy ($Wh\per kg$), noted on the upper axis. Dot size represents predicted maximum volume expansion, and the dots are color-coded to represent the predicted volumetric energy ($Wh\per L$). Due to dataset limitations, model performance varies (see Subsection \ref{['ssec:resutls-batteries']}). The top right corner legends indicates prediction reliability: a green checkmark (✓)for models with $R^2$ and AUC above 0.5, showing higher accuracy, and a yellow warning (!) for models below this threshold.
  • Figure 5: Hexagonal plot of cathode candidate materials in the Energy-GNoME database for (a) Li-ion and (b) Na-ion batteries. The hexagon colors represent material counts per region, as indicated by the color bar. Density distributions, $\rho$, are shown on the plot's top and right, calculated using Gaussian KDE for the average AI-expert probability, $P$, and the average reduction potential, $\Delta V$. Three high-ranking screened cathode materials are shown as primitive crystal units above, identified using $R^{B}(y)$ (see Subsubsection \ref{['sssec:m-case-specific-batteries']}). Atom colors follow the extended CPK corey_molecular_1953 scheme by Jmol noauthor_jmol.
  • ...and 1 more figures