Table of Contents
Fetching ...

Model-independent variable selection via the rule-based variable priority

Min Lu, Hemant Ishwaran

TL;DR

This work introduces a new model-independent approach, Variable Priority (VarPro), which works by utilizing rules without the need to generate artificial data or evaluate prediction error, and investigates the asymptotic properties of VarPro and shows, among other things, that VarPro has a consistent filtering property for noise variables.

Abstract

While achieving high prediction accuracy is a fundamental goal in machine learning, an equally important task is finding a small number of features with high explanatory power. One popular selection technique is permutation importance, which assesses a variable's impact by measuring the change in prediction error after permuting the variable. However, this can be problematic due to the need to create artificial data, a problem shared by other methods as well. Another problem is that variable selection methods can be limited by being model-specific. We introduce a new model-independent approach, Variable Priority (VarPro), which works by utilizing rules without the need to generate artificial data or evaluate prediction error. The method is relatively easy to use, requiring only the calculation of sample averages of simple statistics, and can be applied to many data settings, including regression, classification, and survival. We investigate the asymptotic properties of VarPro and show, among other things, that VarPro has a consistent filtering property for noise variables. Empirical studies using synthetic and real-world data show the method achieves a balanced performance and compares favorably to many state-of-the-art procedures currently used for variable selection.

Model-independent variable selection via the rule-based variable priority

TL;DR

This work introduces a new model-independent approach, Variable Priority (VarPro), which works by utilizing rules without the need to generate artificial data or evaluate prediction error, and investigates the asymptotic properties of VarPro and shows, among other things, that VarPro has a consistent filtering property for noise variables.

Abstract

While achieving high prediction accuracy is a fundamental goal in machine learning, an equally important task is finding a small number of features with high explanatory power. One popular selection technique is permutation importance, which assesses a variable's impact by measuring the change in prediction error after permuting the variable. However, this can be problematic due to the need to create artificial data, a problem shared by other methods as well. Another problem is that variable selection methods can be limited by being model-specific. We introduce a new model-independent approach, Variable Priority (VarPro), which works by utilizing rules without the need to generate artificial data or evaluate prediction error. The method is relatively easy to use, requiring only the calculation of sample averages of simple statistics, and can be applied to many data settings, including regression, classification, and survival. We investigate the asymptotic properties of VarPro and show, among other things, that VarPro has a consistent filtering property for noise variables. Empirical studies using synthetic and real-world data show the method achieves a balanced performance and compares favorably to many state-of-the-art procedures currently used for variable selection.
Paper Structure (26 sections, 5 theorems, 84 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 26 sections, 5 theorems, 84 equations, 10 figures, 2 tables, 1 algorithm.

Key Result

Theorem 3.1

Assume that (A1), (A2), (A3) and (A4) hold. If $K_n\le O(\log n)$ and $m_{n,k}\ge m_n=n^{1/2}\gamma_n$ where $\gamma_n\uparrow\infty$ at a rate faster than $\log n$, then $\Delta_n(S) \stackrel{\text{p}}\rightarrow 0$ if $S\subseteq\mathpzc{N}$.

Figures (10)

  • Figure 1: Two-dimensional illustration of how VarPro differs from artificial data methods. (A) The two-dimensional region for $\zeta$ is a rectangle. The data of interest are marked in blue. (B) Permutation variable importance (VIMP) for $X^{(2)}$. The data was permuted along $X^{(2)}$ and data marked in red with triangles identify values that do not match the joint distribution of ${\bf X}$. The model-based predicted values $\tilde{y}:=\tilde{y}(x^{(1)},\tilde{x}^{(2)})$ for these artificial points are extrapolated from a region of the feature space that could be from potentially different responses. (C) VarPro release region for $X^{(2)}$. The original rule is modified to the $S$-released rule $\zeta^S$ (where $S=\{2\}$) shown using a pink background color. (D) VarPro importance score is defined using the estimator calculated using observed data values in blue compared to the estimator where the new released values in red are additionally used. No artificial data needs to be created.
  • Figure 2: Regions $R(\zeta)$ (in blue) for rules $\zeta$ produced by a machine learning procedure. Top left is for an elliptical rule; bottom left is for a hyperplane rule. Middle and right column figures are release region $R(\zeta^S)$ when releasing coordinates $X^{(2)}$ with $S=\{2\}$ (red plus blue region) and $X^{(1)}$ with $S=\{1\}$ (green plus blue region), respectively.
  • Figure 3: Rank of each procedure for correlated regression simulation experiments (lower indicates better performance).
  • Figure 4: Multiclass experiment where variables 1--3 are most informative for class 1, variables 4--6 for class 2 and variables 7--9 for class 3; variables 10--20 are noise variables. Variables 3 and 10, 6 and 15, 9 and 20 are strongly correlated: thus there is correlation between signal and noise features. (A) VarPro importance correctly identifies the group structure and is not influenced by correlation. (B) BC-VIMP from random forests is influenced by correlation that degrades its performance.
  • Figure 5: Feature selection performance over high-dimensional microarray classification datasets. Three simulation models were used for creating synthetic $Y$ classification labels: (A) Linear; (B) Quadratic; (C) Quadratic-Interaction.
  • ...and 5 more figures

Theorems & Definitions (11)

  • Definition 1.1
  • Definition 2.1
  • Theorem 3.1
  • Theorem 3.2
  • Theorem 5.1
  • Lemma A.1
  • proof
  • Lemma B.1
  • proof
  • proof
  • ...and 1 more