Table of Contents
Fetching ...

Known Unknowns: Out-of-Distribution Property Prediction in Materials and Molecules

Nofit Segal, Aviv Netanyahu, Kevin P. Greenman, Pulkit Agrawal, Rafael Gomez-Bombarelli

TL;DR

The paper tackles extrapolating material and molecular properties to out-of-distribution values. It introduces Bilinear Transduction, an anchor-based, transductive method that leverages analogical input–target changes to enable zero-shot OOD extrapolation. Across solid-state and molecular benchmarks, the approach yields substantial gains in OOD true positive rate and precision, and often improves OOD prediction accuracy compared with non-transductive baselines. This method enhances screening efficiency for high-potential candidates and provides interpretable analogies that reflect chemical changes, with broad applicability to other materials and molecular tasks.

Abstract

Discovery of high-performance materials and molecules requires identifying extremes with property values that fall outside the known distribution. Therefore, the ability to extrapolate to out-of-distribution (OOD) property values is critical for both solid-state materials and molecular design. Our objective is to train predictor models that extrapolate zero-shot to higher ranges than in the training data, given the chemical compositions of solids or molecular graphs and their property values. We propose using a transductive approach to OOD property prediction, achieving improvements in prediction accuracy. In particular, the True Positive Rate (TPR) of OOD classification of materials and molecules improved by 3x and 2.5x, respectively, and precision improved by 2x and 1.5x compared to non-transductive baselines. Our method leverages analogical input-target relations in the training and test sets, enabling generalization beyond the training target support, and can be applied to any other material and molecular tasks.

Known Unknowns: Out-of-Distribution Property Prediction in Materials and Molecules

TL;DR

The paper tackles extrapolating material and molecular properties to out-of-distribution values. It introduces Bilinear Transduction, an anchor-based, transductive method that leverages analogical input–target changes to enable zero-shot OOD extrapolation. Across solid-state and molecular benchmarks, the approach yields substantial gains in OOD true positive rate and precision, and often improves OOD prediction accuracy compared with non-transductive baselines. This method enhances screening efficiency for high-potential candidates and provides interpretable analogies that reflect chemical changes, with broad applicability to other materials and molecular tasks.

Abstract

Discovery of high-performance materials and molecules requires identifying extremes with property values that fall outside the known distribution. Therefore, the ability to extrapolate to out-of-distribution (OOD) property values is critical for both solid-state materials and molecular design. Our objective is to train predictor models that extrapolate zero-shot to higher ranges than in the training data, given the chemical compositions of solids or molecular graphs and their property values. We propose using a transductive approach to OOD property prediction, achieving improvements in prediction accuracy. In particular, the True Positive Rate (TPR) of OOD classification of materials and molecules improved by 3x and 2.5x, respectively, and precision improved by 2x and 1.5x compared to non-transductive baselines. Our method leverages analogical input-target relations in the training and test sets, enabling generalization beyond the training target support, and can be applied to any other material and molecular tasks.

Paper Structure

This paper contains 20 sections, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: Bilinear Transduction prediction distribution on out-of-distribution values is closer to the ground truth distribution compared with other machine learning methods. (Left) probability density functions for out-of-distribution materials Debye temperature. Compared with CrabNet, ModNet, and Ridge Regression, Bilinear Transduction (purple) has the most overlap with the ground truth (red). (Right) probability density functions for out-of-distribution molecular hydration free energy. Ground truth (red) and machine learning method predictions estimated with kernel density estimation -- Chemprop, MLP, Random Forest, and Bilinear Transduction (purple) which has the most overlap with the ground truth.
  • Figure 2: In-distribution and out-of-distribution Bulk Modulus (top) and Debye Temperature (bottom) predictions vs. ground truth values. While Ridge Regression kauwe2020can, MODNet de2021materials, CrabNet wang2021compositionally, and Bilinear Transduction (ours), perform well within the training distribution (gray dots bounded by the red horizontal line), Bilinear Transduction extends predictions beyond this range on ood data (red dots) closest to the ground truth, achieving a lower ood MAE and higher TPR.
  • Figure 3: In-distribution, and out-of-distribution Freesolv predictions vs. ground truth values. While Chemprop heid2023chemprop, RF, MLP and Bilinear Transduction (ours), perform well within the training distribution (gray dots bounded by the red horizontal line), only Bilinear Transduction performs well beyond this range on ood data (red dots).
  • Figure 4: Analogy Visualization Solids. AFLOW bulk modulus ood predictions are based on in-distribution anchors, that paired with ood points, form analogies to training pairs. (a) PCA plot of all samples in the dataset. ood-anchor difference is similar to training-anchor difference. (b) Ground truth bulk modulus training and test distributions and ood, anchor, and analogous training point and anchor values. (c) Analogy compositional visualization. ood and training points differ by one neighboring f-block element. So do ood and training anchors.
  • Figure 5: Analogy Visualization Molecules. MoleculeNet ESOL ood predictions are based on differences between in-distribution anchors and ood points, that form analogies to training pairs. (a) ood-anchor and training target-anchor differences. (b) Ground truth training and test distributions and ood, anchor, and analogous training target, anchor values. (c) Analogous molecule pairs. ood-anchor and training target-anchor similarities measured with MCS highlighted in red.
  • ...and 4 more figures