Table of Contents
Fetching ...

The Rashomon Importance Distribution: Getting RID of Unstable, Single Model-based Variable Importance

Jon Donnelly, Srikar Katta, Cynthia Rudin, Edward P. Browne

TL;DR

This work tackles the problem that many models can explain observational data equally well, which undermines trustworthy variable-importance conclusions. It introduces the Rashomon Importance Distribution (RID), a model-class-agnostic framework that quantifies variable importance across all good models within the Rashomon set $\mathcal{R}^{\varepsilon}_{\mathcal{D}^{(n)}}$ and across bootstrap perturbations, yielding a distribution rather than a single metric. The authors define RID via $\text{RID}_j(k; \varepsilon, \mathcal{F}, \ell, \mathcal{P}_n, \lambda)$ and establish estimators $\widehat{\text{RID}}_j$ with finite-sample guarantees under a Lipschitz-type assumption relating the data-generating process to the Rashomon distribution. Empirically, RID outperforms baselines on synthetic data generation processes and reveals novel biological insights in an HIV-related case study, notably highlighting LINC00486 as a stable, high-impact variable. Overall, RID enhances reproducibility and reliability of variable-importance assessments by incorporating both the Rashomon effect and data perturbations across diverse model classes.

Abstract

Quantifying variable importance is essential for answering high-stakes questions in fields like genetics, public policy, and medicine. Current methods generally calculate variable importance for a given model trained on a given dataset. However, for a given dataset, there may be many models that explain the target outcome equally well; without accounting for all possible explanations, different researchers may arrive at many conflicting yet equally valid conclusions given the same data. Additionally, even when accounting for all possible explanations for a given dataset, these insights may not generalize because not all good explanations are stable across reasonable data perturbations. We propose a new variable importance framework that quantifies the importance of a variable across the set of all good models and is stable across the data distribution. Our framework is extremely flexible and can be integrated with most existing model classes and global variable importance metrics. We demonstrate through experiments that our framework recovers variable importance rankings for complex simulation setups where other methods fail. Further, we show that our framework accurately estimates the true importance of a variable for the underlying data distribution. We provide theoretical guarantees on the consistency and finite sample error rates for our estimator. Finally, we demonstrate its utility with a real-world case study exploring which genes are important for predicting HIV load in persons with HIV, highlighting an important gene that has not previously been studied in connection with HIV. Code is available at https://github.com/jdonnelly36/Rashomon_Importance_Distribution.

The Rashomon Importance Distribution: Getting RID of Unstable, Single Model-based Variable Importance

TL;DR

This work tackles the problem that many models can explain observational data equally well, which undermines trustworthy variable-importance conclusions. It introduces the Rashomon Importance Distribution (RID), a model-class-agnostic framework that quantifies variable importance across all good models within the Rashomon set and across bootstrap perturbations, yielding a distribution rather than a single metric. The authors define RID via and establish estimators with finite-sample guarantees under a Lipschitz-type assumption relating the data-generating process to the Rashomon distribution. Empirically, RID outperforms baselines on synthetic data generation processes and reveals novel biological insights in an HIV-related case study, notably highlighting LINC00486 as a stable, high-impact variable. Overall, RID enhances reproducibility and reliability of variable-importance assessments by incorporating both the Rashomon effect and data perturbations across diverse model classes.

Abstract

Quantifying variable importance is essential for answering high-stakes questions in fields like genetics, public policy, and medicine. Current methods generally calculate variable importance for a given model trained on a given dataset. However, for a given dataset, there may be many models that explain the target outcome equally well; without accounting for all possible explanations, different researchers may arrive at many conflicting yet equally valid conclusions given the same data. Additionally, even when accounting for all possible explanations for a given dataset, these insights may not generalize because not all good explanations are stable across reasonable data perturbations. We propose a new variable importance framework that quantifies the importance of a variable across the set of all good models and is stable across the data distribution. Our framework is extremely flexible and can be integrated with most existing model classes and global variable importance metrics. We demonstrate through experiments that our framework recovers variable importance rankings for complex simulation setups where other methods fail. Further, we show that our framework accurately estimates the true importance of a variable for the underlying data distribution. We provide theoretical guarantees on the consistency and finite sample error rates for our estimator. Finally, we demonstrate its utility with a real-world case study exploring which genes are important for predicting HIV load in persons with HIV, highlighting an important gene that has not previously been studied in connection with HIV. Code is available at https://github.com/jdonnelly36/Rashomon_Importance_Distribution.
Paper Structure (23 sections, 11 theorems, 76 equations, 17 figures, 3 tables)

This paper contains 23 sections, 11 theorems, 76 equations, 17 figures, 3 tables.

Key Result

Theorem 1

Let Assumption asm:lipschitz hold for distributional distance $\rho(A_1, A_2)$ between distributions $A_1$ and $A_2$. For any $t > 0$, $j \in \{0, \hdots, p\}$ as $\rho\left(LD^*(\cdot; \ell, n, \lambda), \textit{RLD}(\cdot; \varepsilon, \mathcal{F}, \ell, \mathcal{P}_n, \lambda) \right) \to 0$ and

Figures (17)

  • Figure 1: Statistics of Rashomon sets computed across 500 bootstrap replicates of a given dataset sampled from the Monk 3 data generation process thrun1991monk. The original dataset consisted of 124 observations, and the Rashomon set was calculated using its definition in Equation \ref{['eqn:Rset']}, with parameters specified in Section D of the supplement. The Rashomon set size is the number of models with loss below a threshold. Model reliance is a measure of variable importance for a single variable --- in this case, $X_2$ --- and Model Class Reliance (MCR) is its range over the Rashomon set. Both the Rashomon set size and model class reliance are unstable across bootstrap iterations.
  • Figure 2: An overview of our framework. Step 1: We bootstrap multiple datasets from the original. Step 2: We show the loss values over the model class for each bootstrapped dataset, differentiated by color. The dotted line marks the Rashomon threshold; all models whose loss is under the threshold are in the Rashomon set for that bootstrapped dataset. On top, we highlight the number of bootstrapped datasets for which the corresponding model is in the Rashomon set. Step 3: We then compute the distribution of model reliance (variable importance -- VI) values for variable $j$ across the Rashomon set for each bootstrapped dataset. Step 4: We then average the corresponding CDF across bootstrap replicates into a single CDF (in purple). Step 5: Using the CDF, we compute the marginal distribution (PDF) of variable importance for variable $j$ across the Rashomon sets of bootstrapped datasets.
  • Figure 3: (Top) The proportion of features ranked correctly by each method on each data set represented as a stacked barplot. The figures are ordered by method performance across the four simulation setups. (Bottom) The proportion of independent DGP $\phi^{(sub)}$ calculations on 500 new datasets from the DGP that were contained within the box-and-whiskers range computed using a single training set (with bootstrapping in all methods except VIC) for each method and variable in each simulation. Underneath each method's label, the first row shows the percentage of times across all 500 independently generated datasets and variables that the DGP's variable importance was inside of that method's box-and-whiskers interval. The second row shows the percentage of pairwise rankings correct for each method (from the top plot). Higher is better.
  • Figure 4: We generate 50 independent datasets from Chen's DGP and calculate MCR, BWRs for VIC, and BWRs for RID. The above plot shows the interval for each dataset for each non-null variable in Chen's DGP. All red-colored intervals do not overlap with at least one of the remaining 49 intervals.
  • Figure 5: Median Jaccard similarity scores across 50 independently generated MCR, VIC, and RID box and whisker ranges for each DGP; 1 is perfect similarity. Error bars show 95% confidence interval around the median.
  • ...and 12 more figures

Theorems & Definitions (20)

  • Theorem 1
  • Theorem 2
  • Theorem A.1
  • proof
  • Theorem A.2
  • proof
  • Corollary 1
  • proof
  • Corollary 2
  • proof
  • ...and 10 more