A General Approach for Determining Applicability Domain of Machine Learning Models

Lane E. Schultz; Yiqi Wang; Ryan Jacobs; Dane Morgan

A General Approach for Determining Applicability Domain of Machine Learning Models

Lane E. Schultz, Yiqi Wang, Ryan Jacobs, Dane Morgan

TL;DR

A new and general approach of assessing model domain is developed and it is demonstrated that this approach provides accurate and meaningful domain designation across multiple model types and material property data sets.

Abstract

Knowledge of the domain of applicability of a machine learning model is essential to ensuring accurate and reliable model predictions. In this work, we develop a new and general approach of assessing model domain and demonstrate that our approach provides accurate and meaningful domain designation across multiple model types and material property data sets. Our approach assesses the distance between data in feature space using kernel density estimation, where this distance provides an effective tool for domain determination. We show that chemical groups considered unrelated based on chemical knowledge exhibit significant dissimilarities by our measure. We also show that high measures of dissimilarity are associated with poor model performance (i.e., high residual magnitudes) and poor estimates of model uncertainty (i.e., unreliable uncertainty estimation). Automated tools are provided to enable researchers to establish acceptable dissimilarity thresholds to identify whether new predictions of their own machine learning models are in-domain versus out-of-domain.

A General Approach for Determining Applicability Domain of Machine Learning Models

TL;DR

Abstract

Paper Structure (37 sections, 8 equations, 10 figures, 2 tables)

This paper contains 37 sections, 8 equations, 10 figures, 2 tables.

Introduction
Results
Discussion
Methods
Data Availability
Code Availability
Acknowledgements
Author Contributions
Competing Interests

Figures (10)

Figure 1: Visulaization of scores from Table \ref{['table_results']}. For all points shown in these figures, a score closer to 1 is better, with decreasing values being worse. AUC, AUC-Baseline, precision, recall, and $F1_{max}$ are shown by the blue, orange, green, red, and purple markers, respectively. The type of assessment is noted by the subfigure caption.
Figure 2: KDE separates distinct materials. We show the violin plot for all the $d$ scores separated by chemical groups. The first, second, and third vertical lines within each violin denote the separations between the first, second, third, and fourth quartiles. Values to the left are more likely to be observed compared to values to the right. All violins were forced to have the same width for visual purposes (i.e., the actual number of observations are not reflected by the visual). Green and red violins denote $ID$ and $OD$ groups, respectively. The data set is denoted by the captions.
Figure 3: Absolute residuals grow as OOB data becomes increasingly dissimilar. The relationship between $E^{|y-\hat{y}|/MAD_{y}}$ and $d$ for the RF model type is shown. Generally, $E^{|y-\hat{y}|/MAD_{y}}$ increases with an increase in $d$. $E^{|y-\hat{y}|/MAD_{y}}_{c}$ is shown by the horizontal red line, which separates our $OD$ (red) and $ID$ (green) cases. The data set is denoted by the captions.
Figure 4: Lowering Ground Truth Error. Lowering $E^{|y-\hat{y}|/MAD_{y}}_{c}$ has a large impact on how the threshold for $M^{dom}$ is chosen and its corresponding metrics like precision. These data points, identical to those in Fig. \ref{['friedman_res']}, have been reclassified as $ID/OD$ based on lower ground truth thresholds of 0.25 and 0.01 for Figs. \ref{['lower_gt']} and \ref{['even_lower_gt']}, respectively (indicated by the horizontal red line in each figure). $ID$ points are green while $OD$ points are red.
Figure 5: RMSE grows as OOB data becomes increasingly dissimilar. The relationship between $E^{RMSE/\sigma_{y}}$ and $d$ for the RF model type is shown. Generally, $E^{RMSE/\sigma_{y}}$ increases with an increase in $d$. $E^{RMSE/\sigma_{y}}_{c}$ is shown by the horizontal red line, which separates our $OD$ (red) and $ID$ (green) bins. The data set is denoted by the captions.
...and 5 more figures

A General Approach for Determining Applicability Domain of Machine Learning Models

TL;DR

Abstract

A General Approach for Determining Applicability Domain of Machine Learning Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)