Classification-based detection and quantification of cross-domain data bias in materials discovery

Giovanni Trezza; Eliodoro Chiavazzo

Classification-based detection and quantification of cross-domain data bias in materials discovery

Giovanni Trezza, Eliodoro Chiavazzo

TL;DR

This work addresses the problem of cross-domain data bias in AI-driven materials discovery by introducing a binary classifier-based filter that distinguishes materials within the training domain from out-of-domain samples using a general reference space (MaterialsProject). It formalizes in-distribution vs cross-domain scenarios and validates the approach with two case studies—superconductors (SuperCon) and thermoelectrics (ESTM)—showing that classifier performance correlates with regression reliability and that stronger filtering improves predictive accuracy. SHAP-based feature analysis identifies key descriptors driving predictions, and two validity checks (clustering-based and threshold-based) demonstrate the method's robustness and its ability to override bias when filtering. The framework yields a practical bias metric via AUC and offers a scalable, architecture-agnostic tool that can complement generative models and help prevent unreliable predictions in large-scale materials discovery pipelines.

Abstract

It stands to reason that the amount and the quality of data is of key importance for setting up accurate AI-driven models. Among others, a fundamental aspect to consider is the bias introduced during sample selection in database generation. This is particularly relevant when a model is trained on a specialized dataset to predict a property of interest, and then applied to forecast the same property over samples having a completely different genesis. Indeed, the resulting biased model will likely produce unreliable predictions for many of those out-of-the-box samples. Neglecting such an aspect may hinder the AI-based discovery process, even when high quality, sufficiently large and highly reputable data sources are available. In this regard, with superconducting and thermoelectric materials as two prototypical case studies in the field of energy material discovery, we present and validate a new method (based on a classification strategy) capable of detecting, quantifying and circumventing the presence of cross-domain data bias.

Classification-based detection and quantification of cross-domain data bias in materials discovery

TL;DR

Abstract

Paper Structure (17 sections, 7 figures)

This paper contains 17 sections, 7 figures.

Introduction
Methods
Cross-domain data bias: assessing and circumventing
Validity checks
Supervised ML models
Results
Superconducting materials
SuperCon and featurization
Cross-domain bias detection and quantification
Features importances
Validity checks
Thermoelectric materials
ESTM and featurization
Cross-domain bias detection and quantification
Features importances
...and 2 more sections

Figures (7)

Figure 1: Overview of the main bias types and of the proposed methodology to circumvent it in materials discovery. a Sketches of in-distribution and cross-domain data-biases (adapted from ref. bahng2020learning). In the former, training samples and out-of-the-box samples share the same distributions in terms of the signal $S$, the bias $B$ and the target colour $Y$ (e.g., training samples and out-of-the-box samples come from the same source of data); in the latter, bias distributions are different (e.g., training samples and out-of-the-box samples come from the different sources of data). b Proposed methodology to detect and circumvent bias in materials discovery: a regression model is trained for the prediction of a target property $f(\mathbf{x})$; when predicting $f(\mathbf{x})$ for an out-of-the-box material, such prediction is reliable only if the material belongs to the same training materials space, otherwise no conclusion can be drawn; this is assessed by means of a binary classifier trained on the whole materials space.
Figure 2: Protocol for validating the relationship between regressor and classifier performances (first validity check). The specialized dataset comes with the most important Matminer composition-based features according to a SHAP ranking performed over the corresponding regression model. A partition of such a dataset in two clusters, namely A and B, is obtained with the agglomerative clustering algorithm. Furthermore, the entire dataset is randomly split in an 80/20 partition for training/testing of (i) a classifier for discriminating the two clusters and of (ii) a regressor for the target property of interest $y$ prediction only over cluster A. The trained classifier is employed to discriminate the cluster A/B over the testing set, as the regressor is employed to predict $y$ over testing samples labeled as cluster A, with noise being progressively injected in cluster A/B labeling. In this way it is possible to assess the relationship between classification and regression performances.
Figure 3: Overview of the protocol for validating the relationship between the classifier filtration stringency and the regression performances, along with the choice of MaterialsProject as less biased database (second validity check). The specialized dataset comes with the most important Matminer composition-based features according to a SHAP ranking performed over the corresponding regression model. A partition of such a dataset in two clusters, namely A and B, is obtained with the agglomerative clustering algorithm as reported in Supplementary Note 2. Half of cluster A is utilized to train/test a regressor for the prediction of the property of interest $y$. The same half is utilized as class 1 across a set of 10 classifiers, with class 0 being represented by 10 different random subsets of the MaterialsProject database, each with the same cardinality of class 1. The regression model is employed to predict the $y$ of those materials belonging to the second half of cluster A and to cluster B passing the classifier filtration, i.e., showing an average probability greater than a set threshold to be classified as class 1.
Figure 4: ETC-based pipelines results and datasets compositions. $\mathbf{a}$ ROC curves for classification models over datasets A$^{\prime}$, B$^{\prime}$, C$^{\prime}$, D$^{\prime}$, as in the main text, together with No Skill classifier. $\mathbf{b}$ Normalized cumulative curve for the coefficients of importance of the ETC-based pipeline on dataset A$^{\prime}$. $\mathbf{c}$ Distributions over dataset A$^{\prime}$ for the features "MagpieData mean AtomicWeight" and "MagpieData mean Electronegativity" of materials in SuperCon (blue) and out SuperCon (orange). $\mathbf{d}$ Distributions over dataset B$^{\prime}$ for the features "MagpieData mean AtomicWeight" and "MagpieData mean Electronegativity" of materials in SuperCon (blue) and out SuperCon (orange). $\mathbf{e}$ Distributions over dataset C$^{\prime}$ for the features "MagpieData mean AtomicWeight" and "MagpieData mean Electronegativity" of materials in SuperCon (blue) and out SuperCon (orange). $\mathbf{f}$ Distributions over dataset D$^{\prime}$ for the features "MagpieData mean AtomicWeight" and "MagpieData mean Electronegativity" of materials in SuperCon (blue) and out SuperCon (orange).
Figure 5: Results of the validity checks for the superconductors case. a Predictions of the ETR-based pipeline over the SuperCon testing set along with corresponding performances shown in terms of coefficient of determination $R^2$, Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), with the size of training and testing sets $N_{\textrm{train}}$ and $N_{\textrm{test}}$, respectively. b Corresponding normalized cumulative curve for the SHAP-based coefficients of importance. c Results of the first validity check (as detailed in the main text) in terms of the performances ($R^2$, MAE, RMSE) for a default-hyperparameters ETR regressor trained with the 29 most important features as from subplot b over cluster A only, compared with respect to the AUC of an ETC classifier with default hyperparameters trained for discriminating the two clusters of the SuperCon database, for 1000 distinct noise injections percentages in cluster labels. d Results of the second validity check (as detailed in the main text) in terms of the performances ($R^2$, MAE, RMSE) for the same ETR regressor detailed in subplot c compared with respect to 1000 distinct stringency threshold values, each computed as the threshold above which the average probability over 10 ETC-based classifiers trained with the same 29 features as in subplot b with default hyperparameters has to be set for materials in the testing set to be classified as in the same ETR training materials space.
...and 2 more figures

Classification-based detection and quantification of cross-domain data bias in materials discovery

TL;DR

Abstract

Classification-based detection and quantification of cross-domain data bias in materials discovery

Authors

TL;DR

Abstract

Table of Contents

Figures (7)