Table of Contents
Fetching ...

Predicting band gap from chemical composition: A simple learned model for a material property with atypical statistics

Andrew Ma, Owen Dugan, Marin Soljačić

TL;DR

The paper addresses predicting electronic band gaps from chemical composition when the target variable exhibits a mixed distribution with a discrete mass at $0$ eV and a positive continuous portion. It introduces a simple, interpretable model with one learned parameter per element, using a ReLU transform: $\hat{\varepsilon}_{\mathrm{relu}}(M) = \mathrm{ReLU}(\mathbf{w} \cdot \mathbf{f}(M))$, and compares it to a linear baseline via gradient-based training. Across 10-fold cross-validation, the ReLU model achieves a MAE of $0.575 \pm 0.036$ eV, substantially better than the linear model’s $0.824 \pm 0.035$ eV, and captures the $0$ eV mass while remaining nonnegative; the learned elemental weights provide interpretable chemical insights. This approach offers a fast, interpretable complement to ab initio methods and graph-based models, with potential extensions to incorporate richer structural information or two-stage strategies while preserving a simple, composition-only risk interpretation.

Abstract

In solid-state materials science, substantial efforts have been devoted to the calculation and modeling of the electronic band gap. While a wide range of ab initio methods and machine learning algorithms have been created that can predict this quantity, the development of new computational approaches for studying the band gap remains an active area of research. Here we introduce a simple machine learning model for predicting the band gap using only the chemical composition of the crystalline material. To motivate the form of the model, we first analyze the empirical distribution of the band gap, which sheds new light on its atypical statistics. Specifically, our analysis enables us to frame band gap prediction as a task of modeling a mixed random variable, and we design our model accordingly. Our model formulation incorporates thematic ideas from chemical heuristic models for other material properties in a manner that is suited towards the band gap modeling task. The model has exactly one parameter corresponding to each element, which is fit using data. To predict the band gap for a given material, the model computes a weighted average of the parameters associated with its constituent elements and then takes the maximum of this quantity and zero. The model provides heuristic chemical interpretability by intuitively capturing the associations between the band gap and individual chemical elements.

Predicting band gap from chemical composition: A simple learned model for a material property with atypical statistics

TL;DR

The paper addresses predicting electronic band gaps from chemical composition when the target variable exhibits a mixed distribution with a discrete mass at eV and a positive continuous portion. It introduces a simple, interpretable model with one learned parameter per element, using a ReLU transform: , and compares it to a linear baseline via gradient-based training. Across 10-fold cross-validation, the ReLU model achieves a MAE of eV, substantially better than the linear model’s eV, and captures the eV mass while remaining nonnegative; the learned elemental weights provide interpretable chemical insights. This approach offers a fast, interpretable complement to ab initio methods and graph-based models, with potential extensions to incorporate richer structural information or two-stage strategies while preserving a simple, composition-only risk interpretation.

Abstract

In solid-state materials science, substantial efforts have been devoted to the calculation and modeling of the electronic band gap. While a wide range of ab initio methods and machine learning algorithms have been created that can predict this quantity, the development of new computational approaches for studying the band gap remains an active area of research. Here we introduce a simple machine learning model for predicting the band gap using only the chemical composition of the crystalline material. To motivate the form of the model, we first analyze the empirical distribution of the band gap, which sheds new light on its atypical statistics. Specifically, our analysis enables us to frame band gap prediction as a task of modeling a mixed random variable, and we design our model accordingly. Our model formulation incorporates thematic ideas from chemical heuristic models for other material properties in a manner that is suited towards the band gap modeling task. The model has exactly one parameter corresponding to each element, which is fit using data. To predict the band gap for a given material, the model computes a weighted average of the parameters associated with its constituent elements and then takes the maximum of this quantity and zero. The model provides heuristic chemical interpretability by intuitively capturing the associations between the band gap and individual chemical elements.
Paper Structure (5 sections, 9 equations, 4 figures, 1 table)

This paper contains 5 sections, 9 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: The atypical distribution of the band gap. The empirical cumulative distribution function (eCDF) for electronic band gap is shown for two datasets: a processed version of the Zhuo et al. dataset zhuo2018predictingdunn2020benchmarking and a dataset based on the Materials Project (MP) dunn2020benchmarkingjain2013commentary. From its eCDF, we can see that the band gap has a highly non-standard distribution -- if viewed as a random variable, it is neither a purely discrete random variable nor a purely continuous random variable. For reference, we also show the cumulative distribution function for a normal distribution with mean and variance respectively equal to the sample mean and sample variance of the processed version of the Zhuo et al. dataset.
  • Figure 2: Modeling approach based on one parameter for each chemical element. The only information that is input into the model is the chemical composition (left panel). For a given material, the model makes a heuristic prediction of the band gap based on a weighted average of the parameters of the material's constituent elements followed by the ReLU function (center panel). The model is capable of predicting band gap for both non-zero and zero band gap materials (right panel).
  • Figure 3: Empirical distribution of model predictions. The test empirical cumulative distribution function (eCDF) is shown for the linear model's predictions (top) and the ReLU model's predictions (bottom). The red curves indicate the mean from cross validation and the green shaded region indicates the standard deviation from cross validation. For comparison, the eCDF of the labels (evaluated using the entire dataset) is also shown as a dashed blue curve in both plots. We emphasize that from these plots, we can observe that the ReLU model captures the existence of a discrete probability mass at 0 eV (corresponding to metals), whereas the baseline linear model does not.
  • Figure 4: Periodic table visualization of the learned parameters. For each element $E$, its corresponding learned parameter $w_E$ in the ReLU model is indicated numerically and color-coded based on the scale bar (in units of eV). Elements that are not present in the dataset are displayed in gray. This visualization illustrates the heuristic chemical interpretability enabled by our simple model for band gap.