Estimation of Electronic Band Gap Energy From Material Properties Using Machine Learning
Sagar Prakash Barad, Sajag Kumar, Subhankar Mishra
TL;DR
The paper tackles predicting electronic band gap energy and gap type from fundamental material properties without relying on preliminary DFT calculations or knowledge of the material structure. It introduces a clustered gap predictor (CGP) that partitions non-metals into five clusters and trains cluster-specific regression and gap-type classifiers, along with a shared metal–non-metal classifier. Using Benchmark AFLOW data with 55,298 samples and 9 features (including engineered electronegativity and group_numbers), it defines a joint evaluation score $\text{Score}$ to assess regression, classification, and metal/non-metal decisions. CGP achieves a high overall performance (e.g., AUC-ROC for metal/non-metal = 0.99 and average cluster MAE = $0.2321$ eV) and a final score of $0.9336$, indicating improved predictive capability over a single-model approach; the study suggests future work on more advanced clustering, larger datasets, and extensions to predict other material properties.
Abstract
Machine learning techniques are utilized to estimate the electronic band gap energy and forecast the band gap category of materials based on experimentally quantifiable properties. The determination of band gap energy is critical for discerning various material properties, such as its metallic nature, and potential applications in electronic and optoelectronic devices. While numerical methods exist for computing band gap energy, they often entail high computational costs and have limitations in accuracy and scalability. A machine learning-driven model capable of swiftly predicting material band gap energy using easily obtainable experimental properties would offer a superior alternative to conventional density functional theory (DFT) methods. Our model does not require any preliminary DFT-based calculation or knowledge of the structure of the material. We present a scheme for improving the performance of simple regression and classification models by partitioning the dataset into multiple clusters. A new evaluation scheme for comparing the performance of ML-based models in material sciences involving both regression and classification tasks is introduced based on traditional evaluation metrics. It is shown that on this new evaluation metric, our method of clustering the dataset results in better performance.
