Table of Contents
Fetching ...

Learning to predict superconductivity

Omri Lesser, Yanjun Liu, Natalie Maus, Aaditya Panigrahi, Krishnanand Mallayya, Leslie M. Schoop, Jacob R. Gardner, Eun-Ah Kim

TL;DR

The paper tackles the challenge of predicting superconductivity by leveraging a data-driven featurization that fuses structural information and elemental properties through graphlet histograms and symmetry indicators derived from the 3DSC CIF database. It introduces a novel Earth Mover's Distance kernel for Gaussian-process learning on histogram-based features, with an explicit proof of kernel validity, and demonstrates high accuracy ($R^2 \approx 0.93$) for $T_c$ prediction along with uncertainty estimates. A striking finding is that a four-feature subset—dominated by the electron affinity difference between neighboring atoms—nearly saturates predictive performance, revealing a universal, chemistry-driven descriptor for $T_c$. The work also delivers a superconductivity classifier with quantified uncertainties and shows the framework's potential to rapidly screen inorganic crystals, extendable to other material properties beyond superconductivity.

Abstract

Predicting the superconducting transition temperature ($T_c$) of materials remains a major challenge in condensed matter physics due to the lack of a comprehensive and quantitative theory. We present a data-driven approach that combines chemistry-informed feature extraction with interpretable machine learning to predict $T_c$ and classify superconducting materials. We develop a systematic featurization scheme that integrates structural and elemental information through graphlet histograms and symmetry vectors. Using experimentally validated structural data from the 3DSC database, we construct a curated, featurized dataset and design a new kernel to incorporate histogram features into Gaussian-process (GP) regression and classification. This framework yields an interpretable $T_c$ predictor with an $ R^2$ value of 0.93 and a superconductor classifier with quantified uncertainties. Feature-significance analysis further reveals that GP $T_c$ predictor can achieve near-optimal performance only using four second-order graphlet features. In particular, we discovered a previously overlooked feature of electron affinity difference between neighboring atoms as a universally predictive descriptor. Our graphlet-histogram approach not only highlights bonding-related elemental descriptors as unexpectedly powerful predictors of superconductivity but also provides a broadly applicable framework for predictive modeling of diverse material properties.

Learning to predict superconductivity

TL;DR

The paper tackles the challenge of predicting superconductivity by leveraging a data-driven featurization that fuses structural information and elemental properties through graphlet histograms and symmetry indicators derived from the 3DSC CIF database. It introduces a novel Earth Mover's Distance kernel for Gaussian-process learning on histogram-based features, with an explicit proof of kernel validity, and demonstrates high accuracy () for prediction along with uncertainty estimates. A striking finding is that a four-feature subset—dominated by the electron affinity difference between neighboring atoms—nearly saturates predictive performance, revealing a universal, chemistry-driven descriptor for . The work also delivers a superconductivity classifier with quantified uncertainties and shows the framework's potential to rapidly screen inorganic crystals, extendable to other material properties beyond superconductivity.

Abstract

Predicting the superconducting transition temperature () of materials remains a major challenge in condensed matter physics due to the lack of a comprehensive and quantitative theory. We present a data-driven approach that combines chemistry-informed feature extraction with interpretable machine learning to predict and classify superconducting materials. We develop a systematic featurization scheme that integrates structural and elemental information through graphlet histograms and symmetry vectors. Using experimentally validated structural data from the 3DSC database, we construct a curated, featurized dataset and design a new kernel to incorporate histogram features into Gaussian-process (GP) regression and classification. This framework yields an interpretable predictor with an value of 0.93 and a superconductor classifier with quantified uncertainties. Feature-significance analysis further reveals that GP predictor can achieve near-optimal performance only using four second-order graphlet features. In particular, we discovered a previously overlooked feature of electron affinity difference between neighboring atoms as a universally predictive descriptor. Our graphlet-histogram approach not only highlights bonding-related elemental descriptors as unexpectedly powerful predictors of superconductivity but also provides a broadly applicable framework for predictive modeling of diverse material properties.

Paper Structure

This paper contains 13 sections, 30 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Database and featurization. (a) Distribution of superconductor classes in the 3DSC database. (b) Examples of entries from 3DSC, listing the chemical formula, $T_c$, and superconducting class. Structural information is provided by CIFs (not shown). (c) The ten atomic properties we use to characterize each element (see SM Table S1). (d) Example of a crystal structure and its first-, second-, and third-order graphlets. We use Li$_{0.8}$FeAs as the example (only a subset of the second- and third-order graphlets are shown for brevity). For each order, histogram features are generated from elemental properties (electron affinity, atomic weight, etc.) and structural information (inter-atomic distance and bond angle). For the second- and third-order graphlets, distinct sets of permutation-invariant histogram features are provided. In second-order graphlets, we calculate means and differences, whereas in third-order graphlets, we calculate the mean, standard deviation, and kurtosis. (e) Second-order histogram features of inter-atomic distances for four cuprate superconductors from different families. Blue bars highlight differences in interlayer distances. (f) The eleven point group operations assigned to each site by its crystallographic point group. The crystal symmetry vector is obtained by averaging site symmetry vectors over all inequivalent occupied sites in the unit cell.
  • Figure 2: Machine learning workflow. (a) Data curation focusing on representative CIF. For materials with multiple CIFs, we focused on cases with graphlet histograms all within EMD$<0.2$, for which we choose one CIF. (b) Flowchart describing the machine learning strategy in this work. We feed graphlet histograms [see Fig. \ref{['fig:intro']}] and symmetry vectors [see Fig. \ref{['fig:intro']}] into neural networks and GP models for the $T_c$ prediction task, and into GP classifier for the SC classification task. (c) For the neural networks, the graphlet histograms go through a convolutional layer before being passed to a fully connected feed-forward NN. (d) The GP models use EMD to quantify similarity between a pair of graphlet histograms. Small EMD indicates similar histograms (top) while large EMD indicates dissimilar histograms (bottom).
  • Figure 3: $T_c$ prediction results. (a) Experimental $T_c$ vs. neural network predictions of $T_c$. (b) Experimental $T_c$ vs GP predictions of $T_c$. GP predictions include uncertainty estimates, with error bars denoting one standard deviation. (c) Relative prediction error vs. relative prediction uncertainty in the GP model, showing that large errors are usually accompanied by large uncertainties. (d) GP model performances using different subsets of features. Using higher-order histogram features and adding symmetry features both improve the model's performances, both in $R^2$ score and in mean absolute error (MAE). (e) Performance of the GP model when only subsets of the features are included. The model maintains almost its full predictive power with as few as four features.
  • Figure 4: Interpretation of the GP $T_c$ prediction model. (a) Inverse length scales $\ell^{-1}$ (feature importance) of one of three four-feature sets with $R^2>0.92$. (b--c) Specific examples of electron affinity (EA) difference histograms (the most predictive atomic feature) are shown in (b) for superconducting cuprates from La-based, Bi-based, and Hg-based families and in (c) for iron-based superconuctors from the 111, 122 and 1111 families. In both cases the bars shift towards larger EA difference with increasing $T_c$. (d) Inverse length scales of the symmetry features. (e) $T_c$ vs. four exemplary average symmetry features. In the left plots, $\sigma_{d}$ and $C_{4}$ [shown as dashed blue lines in (d)], which are learned as highly predictive by the GP, show distinct shapes that partly differentiate between values of $T_c$. In the right plots, $i$ and $C_6$ [shown as dashed gray lines in (d)], which are learned as not predictive by the GP, exhibit either a large spread in $T_c$ ($i$) or almost no change in $T_c$ ($C_6$).
  • Figure 5: SC / non-SC classification. (a) Composition of the dataset for classification. (b) Confusion matrix of the GP classifier along with the uncertainty distributions of the four cases. The GP classifier is slightly biased towards predicting SC. Misclassified materials are associated with higher corresponding uncertainties. (c) Performance of the GP classifier when only subsets of features are included. The model's performance begins to degrade when the number of features falls below 10.
  • ...and 3 more figures