Table of Contents
Fetching ...

Mathematical Theory of Collinearity Effects on Machine Learning Variable Importance Measures

Kelvyn K. Bladen, D. Richard Cutler, Alan Wisler

TL;DR

This paper provides a formal theory for two widely used variable-importance measures, Permute-and-Predict (PaP) and Leave-One-Covariate-Out (LOCO), by embedding them in a latent-variable regression framework. It derives closed-form expressions that link PaP to the coefficient magnitude and predictor variance via $PaP_i = \beta_i \sqrt{2 \operatorname{Var}(\mathbf{x}^v_i)}$ and LOCO to collinearity through $LOCO_i = \beta_i (1-\Delta) \sqrt{1+c}$, with $c$ depending on the collinearity parameter $\Delta$ and dimension $p$. The results explain why PaP is robust to multicollinearity while LOCO is highly sensitive to it, and they connect LOCO to the $t$-statistic through $t_i = LOCO_i \sqrt{\frac{n-1}{\operatorname{Var}(\epsilon)}}$, supported by Monte Carlo simulations and extensions to Random Forests. Finite-sample biases are analyzed and corrected, and the framework is shown to generalize to block-structured data and other interpretability tools. Overall, the work clarifies how data characteristics, coefficients, and covariance structure shape variable-importance measures, enhancing interpretability for practitioners.

Abstract

In many machine learning problems, understanding variable importance is a central concern. Two common approaches are Permute-and-Predict (PaP), which randomly permutes a feature in a validation set, and Leave-One-Covariate-Out (LOCO), which retrains models after permuting a training feature. Both methods deem a variable important if predictions with the original data substantially outperform those with permutations. In linear regression, empirical studies have linked PaP to regression coefficients and LOCO to $t$-statistics, but a formal theory has been lacking. We derive closed-form expressions for both measures, expressed using square-root transformations. PaP is shown to be proportional to the coefficient and predictor variability: $\text{PaP}_i = β_i \sqrt{2\operatorname{Var}(\mathbf{x}^v_i)}$, while LOCO is proportional to the coefficient but dampened by collinearity (captured by $Δ$): $\text{LOCO}_i = β_i (1 -Δ)\sqrt{1 + c}$. These derivations explain why PaP is largely unaffected by multicollinearity, whereas LOCO is highly sensitive to it. Monte Carlo simulations confirm these findings across varying levels of collinearity. Although derived for linear regression, we also show that these results provide reasonable approximations for models like Random Forests. Overall, this work establishes a theoretical basis for two widely used importance measures, helping analysts understand how they are affected by the true coefficients, dimension, and covariance structure. This work bridges empirical evidence and theory, enhancing the interpretability and application of variable importance measures.

Mathematical Theory of Collinearity Effects on Machine Learning Variable Importance Measures

TL;DR

This paper provides a formal theory for two widely used variable-importance measures, Permute-and-Predict (PaP) and Leave-One-Covariate-Out (LOCO), by embedding them in a latent-variable regression framework. It derives closed-form expressions that link PaP to the coefficient magnitude and predictor variance via and LOCO to collinearity through , with depending on the collinearity parameter and dimension . The results explain why PaP is robust to multicollinearity while LOCO is highly sensitive to it, and they connect LOCO to the -statistic through , supported by Monte Carlo simulations and extensions to Random Forests. Finite-sample biases are analyzed and corrected, and the framework is shown to generalize to block-structured data and other interpretability tools. Overall, the work clarifies how data characteristics, coefficients, and covariance structure shape variable-importance measures, enhancing interpretability for practitioners.

Abstract

In many machine learning problems, understanding variable importance is a central concern. Two common approaches are Permute-and-Predict (PaP), which randomly permutes a feature in a validation set, and Leave-One-Covariate-Out (LOCO), which retrains models after permuting a training feature. Both methods deem a variable important if predictions with the original data substantially outperform those with permutations. In linear regression, empirical studies have linked PaP to regression coefficients and LOCO to -statistics, but a formal theory has been lacking. We derive closed-form expressions for both measures, expressed using square-root transformations. PaP is shown to be proportional to the coefficient and predictor variability: , while LOCO is proportional to the coefficient but dampened by collinearity (captured by ): . These derivations explain why PaP is largely unaffected by multicollinearity, whereas LOCO is highly sensitive to it. Monte Carlo simulations confirm these findings across varying levels of collinearity. Although derived for linear regression, we also show that these results provide reasonable approximations for models like Random Forests. Overall, this work establishes a theoretical basis for two widely used importance measures, helping analysts understand how they are affected by the true coefficients, dimension, and covariance structure. This work bridges empirical evidence and theory, enhancing the interpretability and application of variable importance measures.

Paper Structure

This paper contains 20 sections, 5 theorems, 39 equations, 5 figures.

Key Result

Theorem 2.2

Under the latent variable framework in eq:reg_latent and assuming $\beta=\hat{\beta}$, the PaP importance has the closed-form where $\operatorname{Var}(\mathbf{x}^v_i)$ is found by substituting a value of $1$ into $\mathbf{J}$ and $\mathbf{I}$ in eq:cov_x. $\textcolor{qedblue}{\blacksquare}$

Figures (5)

  • Figure 1: Mean and 2 Standard deviation error bars for assessing the relationship between importance values and $\Delta$ (see \ref{['eq:pap']} and \ref{['eq:loco_c_simp']}) for data defined in \ref{['ss2:data']}. Red lines denote theoretical projections. Plots show importance values for raw data and are faceted by $p$ and $n$ dimensions. Plots show that PaP increases with $\Delta$, while LOCO has a nearly linear negative relationship with $\Delta$ (See \ref{['eq:loco_approx']}).
  • Figure 1: Parity scatter plots for assessing the relationship between empirical $t$-statistics and empirical LOCO values for data defined in \ref{['ss2:data']}. Red lines denote a slope of 1. Plots are faceted by $p$ and $n$ dimensions. We used \ref{['eq:loco_t']} to ensure a slope of 1 by transforming the $t$-statistics. We also applied a small sample bias correction from \ref{['s3:ss_bias']} to LOCO. No small sample bias correction was applied to the $t$-statistics since this is already accounted for in their calculation.
  • Figure 2: Mean and 2 Standard deviation error bars for assessing the relationship between Random Forest importance values and $\Delta$ for data defined in \ref{['ss2:data']}. Red lines denote theoretical projections for linear data. Plots show importance values for raw data and are faceted by $p$ and $n$ dimensions. Plot trends appear similar to \ref{['fig:cov_delta']}. However, there are clear anomalies regarding bias within the plot, especially for $\Delta = 0$.
  • Figure 3: Mean and 2 Standard deviation error bars for assessing the relationship between c values and $\Delta$ (see \ref{['conj:c']}) for data defined in \ref{['ss2:data']}. Red lines denote theoretical projections. Plots are faceted by $n$ and $p$ dimensions. Plots show increased stability as $n$ increases, $\Delta$ increases, and $p$ decreases. Plots also show sublinear growth converging toward values of $\frac{1}{p-1}$. Plots are free of any clear bias.
  • Figure 4: Mean and 2 Standard deviation error bars for comparing empirical importance values to the derived theoretical importances (see \ref{['eq:pap']} and \ref{['eq:loco_c_simp']}) for data defined in \ref{['ss2:data']}. Plots show sample size ($n$) on the x-axis and are faceted by dimensions ($p$), importance metric, and whether a sample sample adjustment was applied to the theoretical values. Plots show increased stability as $n$ increases and as $p$ decreases. Plots also show a small sample bias for both raw metrics, but particularly for LOCO. This bias is especially noticeable for small $n$ and large $p$ combinations. Conversely, the adjusted metrics are free of any clear visual bias.

Theorems & Definitions (11)

  • Definition 2.1: $PaP$
  • Theorem 2.2
  • Definition 2.3: $LOCO$
  • Theorem 2.4
  • Conjecture 2.5: $c$
  • Theorem 2.6
  • Definition 2.7: $t$-statistic
  • Theorem 2.8: $t$-statistic as function of $\Delta$
  • Theorem 2.9
  • Proof 1
  • ...and 1 more