Table of Contents
Fetching ...

AICO: Feature Significance Tests for Supervised Learning

Kay Giesecke, Enguerrand Horel, Chartsiri Jirachotkulthorn

TL;DR

AICO tackles the opacity of feature influence in supervised learning by turning interpretability into formal statistical inference. It tests whether a feature genuinely improves model performance by masking the feature and measuring the resulting change in a score, yielding exact finite-sample p-values and confidence intervals without retraining or distributional assumptions. The method relies on a model- and distribution-agnostic framework and introduces a uniformly most powerful randomized sign test, with population-level feature importance captured by the median of the feature-effect distribution. Empirical results on synthetic tasks and real-world credit/mortgage datasets show strong statistical power, robustness to feature correlation, and substantial computational efficiency, supporting transparent and trustworthy data-driven decisions. The work provides a practical, rigorous standard for interpretability that scales to modern large models and complex data structures, accompanied by an open-source Python package.

Abstract

Machine learning has become a central tool across scientific, industrial, and policy domains. Algorithms now identify chemical properties, forecast disease risk, screen borrowers, and guide public interventions. Yet this predictive power often comes at the cost of transparency: we rarely know which input features truly drive a model's predictions. Without such understanding, researchers cannot draw reliable scientific conclusions, practitioners cannot ensure fairness or accountability, and policy makers cannot trust or govern model-based decisions. Despite its importance, existing tools for assessing feature influence are limited -- most lack statistical guarantees, and many require costly retraining or surrogate modeling, making them impractical for large modern models. We introduce AICO, a broadly applicable framework that turns model interpretability into an efficient statistical exercise. AICO asks, for any trained regression or classification model, whether each feature genuinely improves model performance. It does so by masking the feature's information and measuring the resulting change in performance. The method delivers exact, finite-sample inference -- exact feature p-values and confidence intervals -- without any retraining, surrogate modeling, or distributional assumptions, making it feasible for today's large-scale algorithms. In both controlled experiments and real applications -- from credit scoring to mortgage-behavior prediction -- AICO consistently pinpoints the variables that drive model behavior, providing a fast and reliable path toward transparent and trustworthy machine learning.

AICO: Feature Significance Tests for Supervised Learning

TL;DR

AICO tackles the opacity of feature influence in supervised learning by turning interpretability into formal statistical inference. It tests whether a feature genuinely improves model performance by masking the feature and measuring the resulting change in a score, yielding exact finite-sample p-values and confidence intervals without retraining or distributional assumptions. The method relies on a model- and distribution-agnostic framework and introduces a uniformly most powerful randomized sign test, with population-level feature importance captured by the median of the feature-effect distribution. Empirical results on synthetic tasks and real-world credit/mortgage datasets show strong statistical power, robustness to feature correlation, and substantial computational efficiency, supporting transparent and trustworthy data-driven decisions. The work provides a practical, rigorous standard for interpretability that scales to modern large models and complex data structures, accompanied by an open-source Python package.

Abstract

Machine learning has become a central tool across scientific, industrial, and policy domains. Algorithms now identify chemical properties, forecast disease risk, screen borrowers, and guide public interventions. Yet this predictive power often comes at the cost of transparency: we rarely know which input features truly drive a model's predictions. Without such understanding, researchers cannot draw reliable scientific conclusions, practitioners cannot ensure fairness or accountability, and policy makers cannot trust or govern model-based decisions. Despite its importance, existing tools for assessing feature influence are limited -- most lack statistical guarantees, and many require costly retraining or surrogate modeling, making them impractical for large modern models. We introduce AICO, a broadly applicable framework that turns model interpretability into an efficient statistical exercise. AICO asks, for any trained regression or classification model, whether each feature genuinely improves model performance. It does so by masking the feature's information and measuring the resulting change in performance. The method delivers exact, finite-sample inference -- exact feature p-values and confidence intervals -- without any retraining, surrogate modeling, or distributional assumptions, making it feasible for today's large-scale algorithms. In both controlled experiments and real applications -- from credit scoring to mortgage-behavior prediction -- AICO consistently pinpoints the variables that drive model behavior, providing a fast and reliable path toward transparent and trustworthy machine learning.

Paper Structure

This paper contains 20 sections, 5 theorems, 38 equations, 7 figures, 10 tables, 1 algorithm.

Key Result

Proposition 3.2

Under Assumption assumption-a, consider the hypotheses for some $M_0\in\mathbb{R}$. A Uniformly Most Powerful test of size $\alpha\in(0,1)$ rejects $H_0$ with probability $\phi_N(n_+(M_0),\alpha)$ where $n_+(x)=\#\{i\in I_2: \Delta_i> x\}$ is the number of feature effect samples in the test set exceeding $x\in \mathbb{R}$ and where $T_{N,\alpha}=q_{1-\alpha}(B_{N, 1/2})$ and

Figures (7)

  • Figure 1: Distribution of randomized $p$-value and decision intervals.
  • Figure 2: Test sample size $N$ vs. power $H_{N,\alpha}$ for $\alpha=5\%$ and several values of $p>1/2$.
  • Figure 3: Rejection probability vs. CI endpoints.
  • Figure 4: Feature importance scores across training/testing experiments.
  • Figure 5: Average computation times for testing/screening all 19 features.
  • ...and 2 more figures

Theorems & Definitions (13)

  • Definition 2.1: Feature Effect
  • Example 2.1
  • Example 2.2
  • Proposition 3.2
  • Proposition 4.1
  • Proposition 4.2
  • Proposition 4.3
  • Proposition 4.4
  • proof : Proof of Proposition \ref{['ump']}
  • proof : Proof of Proposition \ref{['ci-prop']}
  • ...and 3 more