Table of Contents
Fetching ...

Why Physics Still Matters: Improving Machine Learning Prediction of Material Properties with Phonon-Informed Datasets

Pol Benítez, Cibrán López, Edgardo Saucedo, Teruyasu Mizoguchi, Claudio Cazorla

TL;DR

This work addresses how training data design affects ML predictions of finite-temperature material properties. It compares random-displacement and phonon-informed sampling schemes for anti-perovskites, training graph neural networks (GNNs) to predict energy per atom, band gap, valence band maximum, and hydrostatic stress, with $4{,}500$ configurations across Ag-based compounds. The phonon-informed approach yields higher accuracy and better generalization than random sampling, achieving $R^2 \approx 0.85$ and MAE ~ $0.030$ eV for band gaps using ~1,000 structures, while highlighting that Ag--S bonds govern band-gap variation under thermal conditions. The study demonstrates that physics-guided data generation can outperform larger random datasets and offers a practical, interpretable strategy for efficient ML-driven materials discovery at finite temperature.

Abstract

Machine learning (ML) methods have become powerful tools for predicting material properties with near first-principles accuracy and vastly reduced computational cost. However, the performance of ML models critically depends on the quality, size, and diversity of the training dataset. In materials science, this dependence is particularly important for learning from low-symmetry atomistic configurations that capture thermal excitations, structural defects, and chemical disorder, features that are ubiquitous in real materials but underrepresented in most datasets. The absence of systematic strategies for generating representative training data may therefore limit the predictive power of ML models in technologically critical fields such as energy conversion and photonics. In this work, we assess the effectiveness of graph neural network (GNN) models trained on two fundamentally different types of datasets: one composed of randomly generated atomic configurations and another constructed using physically informed sampling based on lattice vibrations. As a case study, we address the challenging task of predicting electronic and mechanical properties of a prototypical family of optoelectronic materials under realistic finite-temperature conditions. We find that the phonons-informed model consistently outperforms the randomly trained counterpart, despite relying on fewer data points. Explainability analyses further reveal that high-performing models assign greater weight to chemically meaningful bonds that control property variations, underscoring the importance of physically guided data generation. Overall, this work demonstrates that larger datasets do not necessarily yield better GNN predictive models and introduces a simple and general strategy for efficiently constructing high-quality training data in materials informatics.

Why Physics Still Matters: Improving Machine Learning Prediction of Material Properties with Phonon-Informed Datasets

TL;DR

This work addresses how training data design affects ML predictions of finite-temperature material properties. It compares random-displacement and phonon-informed sampling schemes for anti-perovskites, training graph neural networks (GNNs) to predict energy per atom, band gap, valence band maximum, and hydrostatic stress, with configurations across Ag-based compounds. The phonon-informed approach yields higher accuracy and better generalization than random sampling, achieving and MAE ~ eV for band gaps using ~1,000 structures, while highlighting that Ag--S bonds govern band-gap variation under thermal conditions. The study demonstrates that physics-guided data generation can outperform larger random datasets and offers a practical, interpretable strategy for efficient ML-driven materials discovery at finite temperature.

Abstract

Machine learning (ML) methods have become powerful tools for predicting material properties with near first-principles accuracy and vastly reduced computational cost. However, the performance of ML models critically depends on the quality, size, and diversity of the training dataset. In materials science, this dependence is particularly important for learning from low-symmetry atomistic configurations that capture thermal excitations, structural defects, and chemical disorder, features that are ubiquitous in real materials but underrepresented in most datasets. The absence of systematic strategies for generating representative training data may therefore limit the predictive power of ML models in technologically critical fields such as energy conversion and photonics. In this work, we assess the effectiveness of graph neural network (GNN) models trained on two fundamentally different types of datasets: one composed of randomly generated atomic configurations and another constructed using physically informed sampling based on lattice vibrations. As a case study, we address the challenging task of predicting electronic and mechanical properties of a prototypical family of optoelectronic materials under realistic finite-temperature conditions. We find that the phonons-informed model consistently outperforms the randomly trained counterpart, despite relying on fewer data points. Explainability analyses further reveal that high-performing models assign greater weight to chemically meaningful bonds that control property variations, underscoring the importance of physically guided data generation. Overall, this work demonstrates that larger datasets do not necessarily yield better GNN predictive models and introduces a simple and general strategy for efficiently constructing high-quality training data in materials informatics.

Paper Structure

This paper contains 8 sections, 3 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Physical and ML aspects in materials informatics covered in this study. Atomic structure of the reference anti-perovskite system with ions located (a) at the equilibrium lattices sites and (b) around the equilibrium lattice sites. (c) Mapping of a real-space atomic configuration into its graph representation. Crystal periodicity is indicated by dashed lines in the physical representation and self-loops and multiple edges in the graph representation. (d) Property prediction using GNNs. (e) The selected quantities for ML prediction: energy per atom, band gap, valence band maximum, and hydrostatic stress.
  • Figure 2: Performance of best GNN models. DFT values versus GNN model predictions for the training, validation, and test datasets. Normalized distribution of errors in ML predictions are shown in (d),(h),(l),(p). Results are shown for (a--d) energy per atom, (e--h) band gap, (i--l) valence band maximum, and (m--p) hydrostatic stress.
  • Figure 3: Atomically perturbed anti-perovskite structures and configuration space. Reference anti-perovskite structure perturbed following (a) random- and (b) phonon-based atomic displacement schemes. Wavy arrows represent phonon modes, where the color indicates the displaced atom. (c) Configurational space of possible structures with fixed lattice parameters, where each point corresponds to a specific atomic configuration. The gray region indicates thermally accessible states starting from the reference structure. Diamonds and crosses represent non-equilibrium structures generated by Monte Carlo sampling following random- and phonon-based schemes, respectively.
  • Figure 4: Performance of GNN models trained on different datasets for the prediction of band gaps. Training was performed on datasets generated according to (a,b,c) phonon-based and (d,e,f) random displacement-based schemes. Models performance tests done on (a,d) their corresponding dataset, (b,e) their complementary dataset, and (c,f) a combined dataset comprising both atomic-displacement generation schemes. Solid lines indicate average values and shaded areas statistical errors.
  • Figure 5: GNN model explainability for band gap prediction. (a) Reference unit cell for Ag$_3$SBr. (b,c) Mixed structure–graph representation of the system, where large circles represent unit-cell atoms and small circles their periodic images. Red dashed lines mark edges with importance greater than $0.86$, for (b) the best-performing GNN and (c) a poorly performing GNN. The unit cell is highlighted in black. (d) Graph representation showing all edges as gray dashed lines. (e,f) Edge-importance density distributions for the best-performing and poorly-performing GNN, compared with the total and pairwise edge densities. (g) Electronic density of states near the band gap computed with DFT methods. (h) Electronic structure of Ag$_3$SBr around the band gap, highlighting valence and conduction bands and the orbital hybridization of Ag and S $s$ electrons. (i) Band-gap reduction resulting from atomic position perturbations in Ag$_3$SBr.