Predicting band gap from chemical composition: A simple learned model for a material property with atypical statistics
Andrew Ma, Owen Dugan, Marin Soljačić
TL;DR
The paper addresses predicting electronic band gaps from chemical composition when the target variable exhibits a mixed distribution with a discrete mass at $0$ eV and a positive continuous portion. It introduces a simple, interpretable model with one learned parameter per element, using a ReLU transform: $\hat{\varepsilon}_{\mathrm{relu}}(M) = \mathrm{ReLU}(\mathbf{w} \cdot \mathbf{f}(M))$, and compares it to a linear baseline via gradient-based training. Across 10-fold cross-validation, the ReLU model achieves a MAE of $0.575 \pm 0.036$ eV, substantially better than the linear model’s $0.824 \pm 0.035$ eV, and captures the $0$ eV mass while remaining nonnegative; the learned elemental weights provide interpretable chemical insights. This approach offers a fast, interpretable complement to ab initio methods and graph-based models, with potential extensions to incorporate richer structural information or two-stage strategies while preserving a simple, composition-only risk interpretation.
Abstract
In solid-state materials science, substantial efforts have been devoted to the calculation and modeling of the electronic band gap. While a wide range of ab initio methods and machine learning algorithms have been created that can predict this quantity, the development of new computational approaches for studying the band gap remains an active area of research. Here we introduce a simple machine learning model for predicting the band gap using only the chemical composition of the crystalline material. To motivate the form of the model, we first analyze the empirical distribution of the band gap, which sheds new light on its atypical statistics. Specifically, our analysis enables us to frame band gap prediction as a task of modeling a mixed random variable, and we design our model accordingly. Our model formulation incorporates thematic ideas from chemical heuristic models for other material properties in a manner that is suited towards the band gap modeling task. The model has exactly one parameter corresponding to each element, which is fit using data. To predict the band gap for a given material, the model computes a weighted average of the parameters associated with its constituent elements and then takes the maximum of this quantity and zero. The model provides heuristic chemical interpretability by intuitively capturing the associations between the band gap and individual chemical elements.
