Table of Contents
Fetching ...

Aggregate Models, Not Explanations: Improving Feature Importance Estimation

Joseph Paillard, Angel Reyero Lobo, Denis A. Engemann, Bertrand Thirion

TL;DR

The paper tackles instability in model-agnostic feature importance by analyzing how excess risk drives estimation error for popular VIMs (LOCO, SAGE, CFI). It develops a theoretical framework under realistic assumptions and shows that, for LOCO and SAGE, model-level ensembling reduces the leading bias by mitigating the excess risk, while CFI's error depends linearly on model deviation and benefits less from ensembling. Empirical validation on benchmarks and a large UK Biobank proteomics study confirms that ensembling at the model level yields more accurate variable rankings and improves target identification. This work provides practical guidance for robust interpretation of biomedical models and clarifies when model-level ensembling is advantageous for feature-importance estimation.

Abstract

Feature-importance methods show promise in transforming machine learning models from predictive engines into tools for scientific discovery. However, due to data sampling and algorithmic stochasticity, expressive models can be unstable, leading to inaccurate variable importance estimates and undermining their utility in critical biomedical applications. Although ensembling offers a solution, deciding whether to explain a single ensemble model or aggregate individual model explanations is difficult due to the nonlinearity of importance measures and remains largely understudied. Our theoretical analysis, developed under assumptions accommodating complex state-of-the-art ML models, reveals that this choice is primarily driven by the model's excess risk. In contrast to prior literature, we show that ensembling at the model level provides more accurate variable-importance estimates, particularly for expressive models, by reducing this leading error term. We validate these findings on classical benchmarks and a large-scale proteomic study from the UK Biobank.

Aggregate Models, Not Explanations: Improving Feature Importance Estimation

TL;DR

The paper tackles instability in model-agnostic feature importance by analyzing how excess risk drives estimation error for popular VIMs (LOCO, SAGE, CFI). It develops a theoretical framework under realistic assumptions and shows that, for LOCO and SAGE, model-level ensembling reduces the leading bias by mitigating the excess risk, while CFI's error depends linearly on model deviation and benefits less from ensembling. Empirical validation on benchmarks and a large UK Biobank proteomics study confirms that ensembling at the model level yields more accurate variable rankings and improves target identification. This work provides practical guidance for robust interpretation of biomedical models and clarifies when model-level ensembling is advantageous for feature-importance estimation.

Abstract

Feature-importance methods show promise in transforming machine learning models from predictive engines into tools for scientific discovery. However, due to data sampling and algorithmic stochasticity, expressive models can be unstable, leading to inaccurate variable importance estimates and undermining their utility in critical biomedical applications. Although ensembling offers a solution, deciding whether to explain a single ensemble model or aggregate individual model explanations is difficult due to the nonlinearity of importance measures and remains largely understudied. Our theoretical analysis, developed under assumptions accommodating complex state-of-the-art ML models, reveals that this choice is primarily driven by the model's excess risk. In contrast to prior literature, we show that ensembling at the model level provides more accurate variable-importance estimates, particularly for expressive models, by reducing this leading error term. We validate these findings on classical benchmarks and a large-scale proteomic study from the UK Biobank.
Paper Structure (44 sections, 4 theorems, 37 equations, 14 figures)

This paper contains 44 sections, 4 theorems, 37 equations, 14 figures.

Key Result

Proposition 4.3

Under Assumptions assumption:loss_continuity and assumption:finite_variance, we have that

Figures (14)

  • Figure 1: Approximation, estimation, optimization trade-off. To estimate the data-generating process $f_\star$ using a function class $\mathcal{F}$, we first account for the approximation error $\mathcal{E}_\mathrm{app}$ inherent to the best possible function $f^*_{\mathcal{F}}$ in the class. Minimizing empirical risk over a finite training set $\mathcal{D}_n$ introduces estimation error $\mathcal{E}_\mathrm{est}$, resulting in $f_{\mathcal{D}_n}$, which can be reduced via bagging. Finally, the stochasticity of the learning procedure adds optimization error $\mathcal{E}_\mathrm{opt}$, leading to the final estimate $f_{\theta, \mathcal{D}_n }$. This error is mitigated by ensembling over multiple random initializations (voting).
  • Figure 2: Model-level ensembling reduces feature importance estimation error and improves selection. Importance is measured directly on the bagging ensemble (blue) versus averaging individual sub-model importances (orange) on Friedman 1. Model-level ensembling yields both lower MSE, $(\psi_n-\psi_\star)^2$ and higher ROC AUC, indicating more accurate variable ranking and superior feature selection. These gains hold for both LOCO and SAGE across Random Forest (solid) and MLP (dashed) architectures. Error bars represent one standard deviation over 100 random seeds.
  • Figure 3: Bias-variance decomposition of feature importance estimation error. The bar plots show the contributions of the squared bias (dark shades) and the variance (light shades) to the MSE for both the ensemble (blue) and sub-models (orange) strategies. Estimation error corresponds to the MSE, $(\psi_n-\psi_\star)^2$ (Equaition \ref{['eq:error_loco']} and \ref{['eq:error_sage']}). These results were obtained using an MLP model on the Friedman 1 dataset.
  • Figure 4: Ensemble-level importance improves estimation across diverse datasets. Measuring LOCO importance directly on the bagging ensemble (blue) consistently outperforms the average of sub-model scores (orange). Higher $R^2$ scores demonstrate the ensemble's superior predictive performance. This improved prediction translates into more accurate numerical estimates for ranking (lower MSE) and more reliable feature selection (higher ROC AUC). These gains are robust across the Friedman 1, G-function, and Ishigami datasets (rows of the plots), which present varying non-linearities and levels of feature interactions. For all datasets, the number of samples was set to n=512 and the estimator is a MLP. Box plots represent results across 100 random seeds.
  • Figure 5: Identification of proteomic signatures for BMI in the UK Biobank. Feature importance ranking of the top 10 proteins identified with LOCO for the prediction of BMI from 2,922 proteins measured in plasma using the Olink platform ($n = 46,382$ participants). The predictive model was an ensemble comprising 10 LightGBM models. Error bars indicate one standard deviation estimated via 5-fold cross-validation. For each fold, the ensemble importance score and the mean importance across individual models are represented by blue and orange circles, respectively.
  • ...and 9 more figures

Theorems & Definitions (9)

  • Proposition 4.3: Bayes risk estimation error
  • proof : Proof sketch
  • Remark 4.4
  • Remark 4.5
  • Theorem 4.6: Excess risk dominates importance estimation with LOCO
  • Proposition 4.7: Impact of ensembling on importance estimation
  • proof : Proof sketch
  • Theorem 4.10: Excess risk dominates estimation error with SAGE
  • proof