Table of Contents
Fetching ...

Re-Visiting Explainable AI Evaluation Metrics to Identify The Most Informative Features

Ahmed M. Salih

TL;DR

This paper re-evaluates proxy-based explainable AI evaluation methods (ROAR and Permutation Importance) and demonstrates their susceptibility to multicollinearity in real and simulated data. It introduces the Expected Accuracy Interval (EAI), a simple formulaic metric that combines current accuracy with a feature-contribution ratio derived from SHAP scores to bound the anticipated accuracy after removing or permuting the top feature. Through experiments on CDC Diabetes Health Indicators and Wine Quality data, as well as synthetic datasets, the authors show that EAI provides meaningful interval predictions and can reveal the shifting importance of features when collinearity is present. The work highlights the practical utility of EAI for interpretability assessment in correlated feature settings while acknowledging dependencies on per-feature scoring and potential limitations when models are perfectly predictive.

Abstract

Functionality or proxy-based approach is one of the used approaches to evaluate the quality of explainable artificial intelligence methods. It uses statistical methods, definitions and new developed metrics for the evaluation without human intervention. Among them, Selectivity or RemOve And Retrain (ROAR), and Permutation Importance (PI) are the most commonly used metrics to evaluate the quality of explainable artificial intelligence methods to highlight the most significant features in machine learning models. They state that the model performance should experience a sharp reduction if the most informative feature is removed from the model or permuted. However, the efficiency of both metrics is significantly affected by multicollinearity, number of significant features in the model and the accuracy of the model. This paper shows with empirical examples that both metrics suffer from the aforementioned limitations. Accordingly, we propose expected accuracy interval (EAI), a metric to predict the upper and lower bounds of the the accuracy of the model when ROAR or IP is implemented. The proposed metric found to be very useful especially with collinear features.

Re-Visiting Explainable AI Evaluation Metrics to Identify The Most Informative Features

TL;DR

This paper re-evaluates proxy-based explainable AI evaluation methods (ROAR and Permutation Importance) and demonstrates their susceptibility to multicollinearity in real and simulated data. It introduces the Expected Accuracy Interval (EAI), a simple formulaic metric that combines current accuracy with a feature-contribution ratio derived from SHAP scores to bound the anticipated accuracy after removing or permuting the top feature. Through experiments on CDC Diabetes Health Indicators and Wine Quality data, as well as synthetic datasets, the authors show that EAI provides meaningful interval predictions and can reveal the shifting importance of features when collinearity is present. The work highlights the practical utility of EAI for interpretability assessment in correlated feature settings while acknowledging dependencies on per-feature scoring and potential limitations when models are perfectly predictive.

Abstract

Functionality or proxy-based approach is one of the used approaches to evaluate the quality of explainable artificial intelligence methods. It uses statistical methods, definitions and new developed metrics for the evaluation without human intervention. Among them, Selectivity or RemOve And Retrain (ROAR), and Permutation Importance (PI) are the most commonly used metrics to evaluate the quality of explainable artificial intelligence methods to highlight the most significant features in machine learning models. They state that the model performance should experience a sharp reduction if the most informative feature is removed from the model or permuted. However, the efficiency of both metrics is significantly affected by multicollinearity, number of significant features in the model and the accuracy of the model. This paper shows with empirical examples that both metrics suffer from the aforementioned limitations. Accordingly, we propose expected accuracy interval (EAI), a metric to predict the upper and lower bounds of the the accuracy of the model when ROAR or IP is implemented. The proposed metric found to be very useful especially with collinear features.

Paper Structure

This paper contains 13 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Remove and retain and permutation importance approaches.
  • Figure 2: Correlation matrix between the features in the Diabetes dataset.
  • Figure 3: Correlation matrix between the features in the Wine quality dataset.
  • Figure 4: Correlation matrix between the features in the simulated dataset to perform binary classification.