Model-Agnostic Confidence Intervals for Feature Importance: A Fast and Powerful Approach Using Minipatch Ensembles

Luqin Gan; Lili Zheng; Genevera I. Allen

Model-Agnostic Confidence Intervals for Feature Importance: A Fast and Powerful Approach Using Minipatch Ensembles

Luqin Gan, Lili Zheng, Genevera I. Allen

TL;DR

This work develops LOCO-MP, a model-agnostic, distribution-free framework for inferring feature importance by leveraging minipatch ensembles to perform leave-one-covariate-out evaluations without data splitting or refitting on the full dataset. It establishes asymptotic coverage for the feature-importance score Delta_j under mild assumptions, and introduces a variance-barrier safeguard to handle vanishing-variance scenarios, while also enabling distribution-free predictive inference via Jackknife+-minipatch conformal methods. The approach demonstrates favorable statistical power, computational efficiency, and robust performance in simulations and real benchmark data, including correlated-feature settings where traditional occlusion methods struggle. Together, the method provides fast, reliable uncertainty quantification for both predictions and feature importance across tabular regression and classification tasks, with broad applicability to complex scientific datasets.

Abstract

To promote new scientific discoveries from complex data sets, feature importance inference has been a long-standing statistical problem. Instead of testing for parameters that are only interpretable for specific models, there has been increasing interest in model-agnostic methods, often in the form of feature occlusion or leave-one-covariate-out (LOCO) inference. Existing approaches often make distributional assumptions, which can be difficult to verify in practice, or require model refitting and data splitting, which are computationally intensive and lead to losses in power. In this work, we develop a novel, mostly model-agnostic and distribution-free inference framework for feature importance that is computationally efficient and statistically powerful. Our approach is fast as we avoid model refitting by leveraging a form of random observation and feature subsampling called minipatch ensembles; this approach also improves statistical power by avoiding data splitting. Our framework can be applied on tabular data and with any machine learning algorithm, together with minipatch ensembles, for regression and classification tasks. Despite the dependencies induced by using minipatch ensembles, we show that our approach provides asymptotic coverage for the feature importance score of any model under mild assumptions. Finally, our same procedure can also be leveraged to provide valid confidence intervals for predictions, hence providing fast, simultaneous quantification of the uncertainty of both predictions and feature importance. We validate our intervals on a series of synthetic and real data examples, including non-linear settings, showing that our approach detects the correct important features and exhibits many computational and statistical advantages over existing methods.

Model-Agnostic Confidence Intervals for Feature Importance: A Fast and Powerful Approach Using Minipatch Ensembles

TL;DR

Abstract

Paper Structure (55 sections, 13 theorems, 115 equations, 14 figures, 7 tables, 3 algorithms)

This paper contains 55 sections, 13 theorems, 115 equations, 14 figures, 7 tables, 3 algorithms.

Introduction
Organization
Model-agnostic Inference for Feature Importance
Prior method: LOCO-Split
Leveraging an ensemble learning framework: minipatch learning
Target of Inference: Feature Importance Score
Fast Feature Importance Inference
Coverage Guarantees for Feature Importance Confidence Intervals
Notation
Guarantees for Feature Importance Inference
A Strategy for Handling Vanishing Variance
Discussion on Our Feature Importance Inference
Challenges Associated with Our Inference Approach
Illustration via Linear Models
Comparison with inference targets in prior works:
...and 40 more sections

Key Result

Theorem 1

Suppose that all training data $(X_i, Y_i)\overset{i.i.d}{\sim}\mathcal{P}$ and Assumptions assump:Lip-assump:mpnumber hold. If the sequence of random variables $\{[h_j(X_i,Y_i)-\mathbb{E}(h_j(X_i,Y_i)]^2/\sigma_j^2\}_{i=1}^N$ is uniformly integrable, then we have where $\sigma_j^2 = \mathrm{Var}_{(X,Y)\sim \mathcal{P}}(h_j(X,Y))$ with $h_j(\cdot,\cdot)$ being defined in eq:h_j_abbrv_def.

Figures (14)

Figure 1: Histograms of $\Delta_j$ for a noise feature $j$, computed from simulated regression and classification data sets, under linear and non-linear data generating models and base learners. The dashed line is the Monte Carlo approximation for $\Delta_j^*$, and we can see that $\Delta_j$ is extremely close to $\Delta_j^*$, which is always negative except for the correlated feature settings. More details on the simulation set-ups and more histograms are included in the Appendix.
Figure 2: Coverage for the inference target \ref{['eq:target_inference']} of a null feature of 90% confidence intervals in synthetic regression data, using ridge, decision tree, and kernel SVM as the base estimators. LOCO-MP has valid coverage near 0.9.
Figure 3: Interval width for the inference target \ref{['eq:target_inference']} of a null feature of 90% confidence intervals in synthetic regression data, using ridge, decision tree, and kernel SVM as the base estimators. LOCO-MP has small interval width that decreases as the sample size $N$ increases.
Figure 4: Inference power on synthetic regression and classification data with $N=500$, $M = 200$. LOCO-MP can achieve type 1 error control under the uncorrelated linear setting (with power < 0.1 when SNR = 0), and is either the best or among the top performers in terms of statistical power across all scenarios.
Figure 5: Feature inference (multiplicity adjusted) on the Wine quality regression data and African Heart Disease classification data, using the random forest as the base predictor. Features whose lower bounds of confidence interval greater than zero are statistically significant. The confidence intervals with an upper bound smaller than zero indicate that such a feature would hamper prediction. LOCO-MP identifies important features that are consistent with previous studies on each data set and is among the best in terms of interval efficiency with the smallest widths.
...and 9 more figures

Theorems & Definitions (21)

Theorem 1: Asymptotic distribution of $\bar{\Delta}_j$
Theorem 2: Consistent variance estimate
Corollary 1: Coverage and width of $\hat{\mathbb{C}}_j$
Theorem 3
Theorem 4
Theorem 5
Theorem 6: Valid inference for $\Delta_j^*$
Theorem 7: Distribution-free Predictive Inference Guarantee
Theorem 8
Theorem 9
...and 11 more

Model-Agnostic Confidence Intervals for Feature Importance: A Fast and Powerful Approach Using Minipatch Ensembles

TL;DR

Abstract

Model-Agnostic Confidence Intervals for Feature Importance: A Fast and Powerful Approach Using Minipatch Ensembles

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (21)