Table of Contents
Fetching ...

Marginal and Conditional Importance Measures from Machine Learning Models and Their Relationship with Conditional Average Treatment Effect

Mohammad Kaviul Anam Khan, Olli Saarela, Rafal Kustra

TL;DR

This work tackles the challenge of interpreting black-box predictors by introducing MVIM, a model-agnostic metric based on the true conditional expectation $f_0$, and CVIM, a conditional permutation-based counterpart designed to mitigate correlation bias. MVIM can be expressed as a quadratic function of the conditional average treatment effect (CATE) for multinomial and continuous treatments, linking prediction importance to causal structure; however, its estimation suffers from bias when predictors are correlated due to extrapolation in low-density regions. The authors develop a bias-variance decomposition, introduce a delta term, and show CVIM (and the adjusted AMVIM) reduces sensitivity to predictor correlations and signals near-positivity violations, with CVIM converging faster than MVIM in simulations. Collectively, MVIM, CVIM, and AMVIM provide a model-agnostic, causally interpretable suite of importance measures applicable to binary, multinomial, and continuous treatments, offering practical guidance for robust variable importance under correlation and causality concerns.

Abstract

Interpreting black-box machine learning models is challenging due to their strong dependence on data and inherently non-parametric nature. This paper reintroduces the concept of importance through "Marginal Variable Importance Metric" (MVIM), a model-agnostic measure of predictor importance based on the true conditional expectation function. MVIM evaluates predictors' influence on continuous or discrete outcomes. A permutation-based estimation approach, inspired by \citet{breiman2001random} and \citet{fisher2019all}, is proposed to estimate MVIM. MVIM estimator is biased when predictors are highly correlated, as black-box models struggle to extrapolate in low-probability regions. To address this, we investigated the bias-variance decomposition of MVIM to understand the source and pattern of the bias under high correlation. A Conditional Variable Importance Metric (CVIM), adapted from \citet{strobl2008conditional}, is introduced to reduce this bias. Both MVIM and CVIM exhibit a quadratic relationship with the conditional average treatment effect (CATE).

Marginal and Conditional Importance Measures from Machine Learning Models and Their Relationship with Conditional Average Treatment Effect

TL;DR

This work tackles the challenge of interpreting black-box predictors by introducing MVIM, a model-agnostic metric based on the true conditional expectation , and CVIM, a conditional permutation-based counterpart designed to mitigate correlation bias. MVIM can be expressed as a quadratic function of the conditional average treatment effect (CATE) for multinomial and continuous treatments, linking prediction importance to causal structure; however, its estimation suffers from bias when predictors are correlated due to extrapolation in low-density regions. The authors develop a bias-variance decomposition, introduce a delta term, and show CVIM (and the adjusted AMVIM) reduces sensitivity to predictor correlations and signals near-positivity violations, with CVIM converging faster than MVIM in simulations. Collectively, MVIM, CVIM, and AMVIM provide a model-agnostic, causally interpretable suite of importance measures applicable to binary, multinomial, and continuous treatments, offering practical guidance for robust variable importance under correlation and causality concerns.

Abstract

Interpreting black-box machine learning models is challenging due to their strong dependence on data and inherently non-parametric nature. This paper reintroduces the concept of importance through "Marginal Variable Importance Metric" (MVIM), a model-agnostic measure of predictor importance based on the true conditional expectation function. MVIM evaluates predictors' influence on continuous or discrete outcomes. A permutation-based estimation approach, inspired by \citet{breiman2001random} and \citet{fisher2019all}, is proposed to estimate MVIM. MVIM estimator is biased when predictors are highly correlated, as black-box models struggle to extrapolate in low-probability regions. To address this, we investigated the bias-variance decomposition of MVIM to understand the source and pattern of the bias under high correlation. A Conditional Variable Importance Metric (CVIM), adapted from \citet{strobl2008conditional}, is introduced to reduce this bias. Both MVIM and CVIM exhibit a quadratic relationship with the conditional average treatment effect (CATE).

Paper Structure

This paper contains 24 sections, 5 theorems, 34 equations, 12 figures, 7 tables, 2 algorithms.

Key Result

Theorem 1

Let the treatment be multinomial with $K$ categories ($X \in \{1, 2, ..., K\}$), with the marginal probabilities for a specific category of $X=x$ is given as $P(X = x) = p_{x}$. With respect to the true conditional expectation $f_{0}$, the MVIM in Equation gvim can be re-written as

Figures (12)

  • Figure 1: Estimated MVIM for the predictor $X_1$ and $X_5$ for completely independent scenario. Panel 1 shows the Box plots of MVIMs from the oracle model, panel 2 shows the Box plots of MVIM from the XGBoost model and panel 3 shows the Box plots from the GAM model
  • Figure 2: Estimated MVIM for the predictor $X_1$ and $X_5$ for simple correlation scenario. Panel 1 shows the Box plots of MVIMs from the oracle model, panel 2 shows the Box plots of MVIM from the XGBoost model and panel 3 shows the Box plots from the GAM model
  • Figure 3: Estimated MVIM for the predictor $X_1$ and $X_5$ for multivariate correlation scenario. Panel 1 shows the Box plots of MVIMs from the oracle model, panel 2 shows the Box plots of MVIM from the XGBoost model and panel 3 shows the Box plots from the GAM model
  • Figure 4: The trajectory of $\delta_j$s and $\widehat{\mathcal{MI}}_j$s of $X_1$ and $X_5$ with increasing training size obtained from the completely independent scenario.
  • Figure 5: The trajectory of $\delta_1$ and $\widehat{\mathcal{MI}}$ of $X_1$ with increasing training size obtained from the simple correlation scenario
  • ...and 7 more figures

Theorems & Definitions (5)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5