A unified framework for evaluating the robustness of machine-learning interpretability for prospect risking

Prithwijit Chowdhury; Ahmad Mustafa; Mohit Prabhushankar; Ghassan AlRegib

A unified framework for evaluating the robustness of machine-learning interpretability for prospect risking

Prithwijit Chowdhury, Ahmad Mustafa, Mohit Prabhushankar, Ghassan AlRegib

TL;DR

A unified framework to generate counterfactuals, quantify necessity and sufficiency, and use these measures to perform a robustness evaluation of the insights provided by LIME and SHAP on high-dimensional structured prospect risking data is proposed.

Abstract

In geophysics, hydrocarbon prospect risking involves assessing the risks associated with hydrocarbon exploration by integrating data from various sources. Machine learning-based classifiers trained on tabular data have been recently used to make faster decisions on these prospects. The lack of transparency in the decision-making processes of such models has led to the emergence of explainable AI (XAI). LIME and SHAP are two such examples of these XAI methods which try to generate explanations of a particular decision by ranking the input features in terms of importance. However, explanations of the same scenario generated by these two different explanation strategies have shown to disagree or be different, particularly for complex data. This is because the definitions of "importance" and "relevance" differ for different explanation strategies. Thus, grounding these ranked features using theoretically backed causal ideas of necessity and sufficiency can prove to be a more reliable and robust way to improve the trustworthiness of the concerned explanation strategies.We propose a unified framework to generate counterfactuals as well as quantify necessity and sufficiency and use these to perform a robustness evaluation of the explanations provided by LIME and SHAP on high dimensional structured prospect risking data. This robustness test gives us deeper insights into the models capabilities to handle erronous data and which XAI module works best in pair with which model for our dataset for hydorcarbon indication.

A unified framework for evaluating the robustness of machine-learning interpretability for prospect risking

TL;DR

Abstract

Paper Structure (13 sections, 6 equations, 11 figures, 9 tables)

This paper contains 13 sections, 6 equations, 11 figures, 9 tables.

Introduction
Necessity and Sufficiency
Methodology
Counterfactual generation method
Necessity score
Sufficiency score
Global Feature Importance
Experiments
Robustness evaluation of LIME and SHAPLEY
Conclusion
Appendix
Top $k$ occurrence for LIME Explanations:
Top $k$ occurrence for SHAP Explanations:

Figures (11)

Figure 1: (a) shows the traditional feature attribute based XAI framework to evaluate local feature importance for a particular decision made on a datapoint by a trained model. (b) is our unified framework to generate forward counterfactuals and global necessity and sufficiency score to do a robustness evaluation (described in Fig x) of given feature attribution (FA) methods (here: LIME and SHAP)
Figure 2: This Figure explains how necessity and sufficiency calculation for a feature, in a classifier fitted distribution space, works. (a) Shows the distribution is the Context space $U$. This distribution is then fitted to a binary classifier model in (b) while separates the space based on the class prediction of $y=y^* \, \& \, y\neq y^*$. (c) and (d) captures the diagrammatic representation of how intervention is done on the feature $x_j$ to $a'$ and $a$ respectively, to calculate the conditional probability $\alpha$ and $\beta$ based on the change in prediction results.
Figure 3: Toy example to validate that the necessity and sufficiency scores generated by the selected method are synonymous with logically calculated impact score of a cause leading to an effect.
Figure 4: Robustness Analysis of LIME scores: Calculating mean Necessity Score for Rank $2$. (a) Record the Feature names which occupy the same rank in all cases. (b) Match them with their corresponding Global Necessity Scores. (c) The average of the Global Scores of the features in a particular rank (here $2$) gives us the robustness analysis of LIME explanation of the decisions made by the model on the dataset.
Figure 5: (a) A plot of the well outcomes by DHI Index vs. Initial Pg. (b) The risk factors assessment flow chart: This figure presents the main steps of the Pg calculation for a given prospect.
...and 6 more figures

A unified framework for evaluating the robustness of machine-learning interpretability for prospect risking

TL;DR

Abstract

A unified framework for evaluating the robustness of machine-learning interpretability for prospect risking

Authors

TL;DR

Abstract

Table of Contents

Figures (11)