Table of Contents
Fetching ...

A feature selection method based on Shapley values robust to concept shift in regression

Carlos Sebastián, Carlos E. González-Guillén

TL;DR

The paper tackles feature selection under concept shift in regression by introducing SHAPEffects, a backward elimination method that ties per-prediction SHAP contributions to prediction errors $err = y - \hat{y}(\mathbf{x})$. By classifying errors into correct, over-, and under-predicted groups via quantiles and computing local feature effects, the method drops features with negative influence, yielding models that resist degradation under shift while remaining competitive in static data. Across synthetic Sudden/Incremental shift scenarios and real-world cases like electricity price forecasting and housing market data, SHAPEffects outperforms or matches state-of-the-art SHAP-based feature selectors and traditional methods, with notable improvements in MAE and stability. The work provides a practical, model-agnostic approach to maintain predictive performance in changing environments and outlines future extensions to automate quantile selection and broaden to classification tasks.

Abstract

Feature selection is one of the most relevant processes in any methodology for creating a statistical learning model. Usually, existing algorithms establish some criterion to select the most influential variables, discarding those that do not contribute to the model with any relevant information. This methodology makes sense in a static situation where the joint distribution of the data does not vary over time. However, when dealing with real data, it is common to encounter the problem of the dataset shift and, specifically, changes in the relationships between variables (concept shift). In this case, the influence of a variable cannot be the only indicator of its quality as a regressor of the model, since the relationship learned in the training phase may not correspond to the current situation. In tackling this problem, our approach establishes a direct relationship between the Shapley values and prediction errors, operating at a more local level to effectively detect the individual biases introduced by each variable. The proposed methodology is evaluated through various examples, including synthetic scenarios mimicking sudden and incremental shift situations, as well as two real-world cases characterized by concept shifts. Additionally, we perform three analyses of standard situations to assess the algorithm's robustness in the absence of shifts. The results demonstrate that our proposed algorithm significantly outperforms state-of-the-art feature selection methods in concept shift scenarios, while matching the performance of existing methodologies in static situations.

A feature selection method based on Shapley values robust to concept shift in regression

TL;DR

The paper tackles feature selection under concept shift in regression by introducing SHAPEffects, a backward elimination method that ties per-prediction SHAP contributions to prediction errors . By classifying errors into correct, over-, and under-predicted groups via quantiles and computing local feature effects, the method drops features with negative influence, yielding models that resist degradation under shift while remaining competitive in static data. Across synthetic Sudden/Incremental shift scenarios and real-world cases like electricity price forecasting and housing market data, SHAPEffects outperforms or matches state-of-the-art SHAP-based feature selectors and traditional methods, with notable improvements in MAE and stability. The work provides a practical, model-agnostic approach to maintain predictive performance in changing environments and outlines future extensions to automate quantile selection and broaden to classification tasks.

Abstract

Feature selection is one of the most relevant processes in any methodology for creating a statistical learning model. Usually, existing algorithms establish some criterion to select the most influential variables, discarding those that do not contribute to the model with any relevant information. This methodology makes sense in a static situation where the joint distribution of the data does not vary over time. However, when dealing with real data, it is common to encounter the problem of the dataset shift and, specifically, changes in the relationships between variables (concept shift). In this case, the influence of a variable cannot be the only indicator of its quality as a regressor of the model, since the relationship learned in the training phase may not correspond to the current situation. In tackling this problem, our approach establishes a direct relationship between the Shapley values and prediction errors, operating at a more local level to effectively detect the individual biases introduced by each variable. The proposed methodology is evaluated through various examples, including synthetic scenarios mimicking sudden and incremental shift situations, as well as two real-world cases characterized by concept shifts. Additionally, we perform three analyses of standard situations to assess the algorithm's robustness in the absence of shifts. The results demonstrate that our proposed algorithm significantly outperforms state-of-the-art feature selection methods in concept shift scenarios, while matching the performance of existing methodologies in static situations.
Paper Structure (15 sections, 8 equations, 10 figures, 10 tables, 1 algorithm)

This paper contains 15 sections, 8 equations, 10 figures, 10 tables, 1 algorithm.

Figures (10)

  • Figure 1: In a) the learned decision frontier is observed with respect to the training data. In b) it can be seen that the relationship between the variables has changed and that the decision frontier learned in training is not valid for the test data.
  • Figure 2: The left-hand side shows that no translation is necessary since, given the quantiles chosen, the model does not show any significant bias. On the right, the model tends to overpredict, so this bias is penalised by shifting the quantiles to define the correctly predicted region.
  • Figure 3: Above is a sudden shift situation. Below is an incremental shift situation.
  • Figure 4: Histograms (for the 81 cases) of the difference of the mean MAE of the proposed method with every other algorithm for the sudden shift case
  • Figure 5: Histograms (for the 81 cases) of the difference of the mean MAE of the proposed method with every other algorithm for the incremental shift case
  • ...and 5 more figures

Theorems & Definitions (2)

  • Definition 1
  • Definition 2