Table of Contents
Fetching ...

Enhancing Variable Importance in Random Forests: A Novel Application of Global Sensitivity Analysis

Giulia Vannucci, Roberta Siciliano, Andrea Saltelli

TL;DR

The paper tackles interpretability in Random Forests by aligning variable importance with the data-generating process via Global Sensitivity Analysis. It proposes RF_GS-VI, a total-sensitivity-index-based variable ranking that uses Saltelli-style sampling and RMSE as the fitness measure to quantify the influence of each predictor. Across three simulated scenarios and two real datasets, RF_GS-VI generally yields correct or more informative rankings than standard VI measures, while highlighting variables that traditional VI may overlook. The work advances Explainable AI by providing a principled, generative view of feature influence and suggests avenues for broader application and theory development.

Abstract

The present work provides an application of Global Sensitivity Analysis to supervised machine learning methods such as Random Forests. These methods act as black boxes, selecting features in high--dimensional data sets as to provide accurate classifiers in terms of prediction when new data are fed into the system. In supervised machine learning, predictors are generally ranked by importance based on their contribution to the final prediction. Global Sensitivity Analysis is primarily used in mathematical modelling to investigate the effect of the uncertainties of the input variables on the output. We apply it here as a novel way to rank the input features by their importance to the explainability of the data generating process, shedding light on how the response is determined by the dependence structure of its predictors. A simulation study shows that our proposal can be used to explore what advances can be achieved either in terms of efficiency, explanatory ability, or simply by way of confirming existing results.

Enhancing Variable Importance in Random Forests: A Novel Application of Global Sensitivity Analysis

TL;DR

The paper tackles interpretability in Random Forests by aligning variable importance with the data-generating process via Global Sensitivity Analysis. It proposes RF_GS-VI, a total-sensitivity-index-based variable ranking that uses Saltelli-style sampling and RMSE as the fitness measure to quantify the influence of each predictor. Across three simulated scenarios and two real datasets, RF_GS-VI generally yields correct or more informative rankings than standard VI measures, while highlighting variables that traditional VI may overlook. The work advances Explainable AI by providing a principled, generative view of feature influence and suggests avenues for broader application and theory development.

Abstract

The present work provides an application of Global Sensitivity Analysis to supervised machine learning methods such as Random Forests. These methods act as black boxes, selecting features in high--dimensional data sets as to provide accurate classifiers in terms of prediction when new data are fed into the system. In supervised machine learning, predictors are generally ranked by importance based on their contribution to the final prediction. Global Sensitivity Analysis is primarily used in mathematical modelling to investigate the effect of the uncertainties of the input variables on the output. We apply it here as a novel way to rank the input features by their importance to the explainability of the data generating process, shedding light on how the response is determined by the dependence structure of its predictors. A simulation study shows that our proposal can be used to explore what advances can be achieved either in terms of efficiency, explanatory ability, or simply by way of confirming existing results.
Paper Structure (12 sections, 15 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 12 sections, 15 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: The Direct Acyclic Graph (DAG) of the recursive regression systems for equations \ref{['eq1']} (1), \ref{['eq2']} (2) and \ref{['eq3']} (3).
  • Figure 2: Violin plot of Monte Carlo distributions of RF_GS-VI, CART-VI, RF-VI, CF-VI and S_MDA-VI for the Scenario 1, $n = 1000$.
  • Figure 3: Violin plot of Monte Carlo distributions of RF_GS-VI, CART-VI, RF-VI, CF-VI and S_MDA-VI for the Scenario 2, $n = 1000$.
  • Figure 4: Violin plot of Monte Carlo distributions of RF_GS-VI, CART-VI, RF-VI, CF-VI and S_MDA-VI for the Scenario 3, $n = 1000$.
  • Figure 5: Energy data set analysis: boxplots of $100$ Monte Carlo replications of VI measures for $Y_1$: in A of RF_GS-VI; in B of RF-VI; in C of CF-VI; in D of S_MDA-VI.
  • ...and 2 more figures