Table of Contents
Fetching ...

A critical assessment of bonding descriptors for predicting materials properties

Aakash Ashok Naik, Nidal Dhamrait, Katharina Ueltzen, Christina Ertural, Philipp Benner, Gian-Marco Rignanese, Janine George

TL;DR

This work critically evaluates quantum-chemical bonding descriptors as features for predicting solid-state material properties, expanding a high-throughput bonding database to roughly 13,000 materials and deriving new descriptors from LOBSTER outputs. By applying all-relevant feature selection, distance-correlation analyses, SHAP, and corrected resampling t-tests across Random Forest and MODNet models, the study shows that bonding descriptors provide complementary, target-dependent predictive value, especially for bond-stiffness, lattice thermal conductivity, and elasticity-related properties, while offering limited gains for global thermodynamic quantities. Symbolic regression with SISSO yields interpretable relationships linking bonding descriptors to key targets (e.g., max_pfc correlating with the strongest bond length ratio; log_klat_300 tied to bonding heterogeneity and volume per atom), highlighting physically intuitive connections. The findings suggest pathways to faster surrogate models and bonding-informed representations (including GNNs) for efficient materials discovery, while indicating that the most gain is realized in directional/local properties rather than averaged thermodynamic quantities.

Abstract

Most machine learning models for materials science rely on descriptors based on materials compositions and structures, even though the chemical bond has been proven to be a valuable concept for predicting materials properties. Over the years, various theoretical frameworks have been developed to characterize bonding in solid-state materials. However, integrating bonding information from these frameworks into machine learning pipelines at scale has been limited by the lack of a systematically generated and validated database. Recent advances in high-throughput bonding analysis workflows have addressed this issue, and our previously computed Quantum-Chemical Bonding Database for Solid-State Materials was extended to include approximately 13,000 materials. This database is then used to derive a new set of quantum-chemical bonding descriptors. A systematic assessment is performed using statistical significance tests to evaluate how the inclusion of these descriptors influences the performance of machine-learning models that otherwise rely solely on structure- and composition-derived features. Models are built to predict elastic, vibrational, and thermodynamic properties typically associated with chemical bonding in materials. The results demonstrate that incorporating quantum-chemical bonding descriptors not only improves predictive performance but also helps identify intuitive expressions for properties such as the projected force constant and lattice thermal conductivity via symbolic regression.

A critical assessment of bonding descriptors for predicting materials properties

TL;DR

This work critically evaluates quantum-chemical bonding descriptors as features for predicting solid-state material properties, expanding a high-throughput bonding database to roughly 13,000 materials and deriving new descriptors from LOBSTER outputs. By applying all-relevant feature selection, distance-correlation analyses, SHAP, and corrected resampling t-tests across Random Forest and MODNet models, the study shows that bonding descriptors provide complementary, target-dependent predictive value, especially for bond-stiffness, lattice thermal conductivity, and elasticity-related properties, while offering limited gains for global thermodynamic quantities. Symbolic regression with SISSO yields interpretable relationships linking bonding descriptors to key targets (e.g., max_pfc correlating with the strongest bond length ratio; log_klat_300 tied to bonding heterogeneity and volume per atom), highlighting physically intuitive connections. The findings suggest pathways to faster surrogate models and bonding-informed representations (including GNNs) for efficient materials discovery, while indicating that the most gain is realized in directional/local properties rather than averaged thermodynamic quantities.

Abstract

Most machine learning models for materials science rely on descriptors based on materials compositions and structures, even though the chemical bond has been proven to be a valuable concept for predicting materials properties. Over the years, various theoretical frameworks have been developed to characterize bonding in solid-state materials. However, integrating bonding information from these frameworks into machine learning pipelines at scale has been limited by the lack of a systematically generated and validated database. Recent advances in high-throughput bonding analysis workflows have addressed this issue, and our previously computed Quantum-Chemical Bonding Database for Solid-State Materials was extended to include approximately 13,000 materials. This database is then used to derive a new set of quantum-chemical bonding descriptors. A systematic assessment is performed using statistical significance tests to evaluate how the inclusion of these descriptors influences the performance of machine-learning models that otherwise rely solely on structure- and composition-derived features. Models are built to predict elastic, vibrational, and thermodynamic properties typically associated with chemical bonding in materials. The results demonstrate that incorporating quantum-chemical bonding descriptors not only improves predictive performance but also helps identify intuitive expressions for properties such as the projected force constant and lattice thermal conductivity via symbolic regression.
Paper Structure (18 sections, 11 equations, 8 figures, 4 tables)

This paper contains 18 sections, 11 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Descriptor ranking based on ARFS scores for the maximum of bond-projected force constant (max_pfc), where the legend distinguishes descriptor groups derived from "MATMINER" (structure- and composition-based) and "LOBSTER" (extracted from LOBSTER calculation data).
  • Figure 2: Distance correlation between descriptor sets and targets for (a) Maximum of bond-projected force constant, max_pfc and (b) Heat capacity at 25 K, Cv_25. Descriptor sets are "MATMINER" (structure and composition), "LOBSTER" (extracted from LOBSTER calculation data), and their combination "MATMINER+LOBSTER". Only the descriptors relevant to each target are included from each set in this analysis.
  • Figure 3: Descriptor ranking as SHAP scores for the maximum of bond-projected force (max_pfc) constant from MODNet model, where the legend distinguishes descriptor groups derived from "MATMINER" (structure- and composition-based) and "LOBSTER" (extracted from LOBSTER calculation data).
  • Figure 4: Corrected resampling t-test based on per-fold mean absolute errors from 10-fold cross-validation for model comparison using the "MATMINER" (structure and composition) and "MATMINER+LOBSTER" (extracted from LOBSTER calculation data) descriptor sets. a) Total lattice thermal conductivity at 300 K, log_klat_300, from RF model shows a 2.85 % improvement with the larger feature set, and b) Maximum of bond-projected force constant, max_pfc, from MODNet model shows a 19.5 % improvement.
  • Figure 5: Parity plots showing correlation between (a) Last peak of phonon density of states ,"last_ph_peak” and Maximum of bond-projected force constant, "max_pfc” (b) Absolute value of the strongest bond, i.e, minimum "ICOHP” in a material and "max_pfc".
  • ...and 3 more figures