Table of Contents
Fetching ...

Sampling-guided exploration of active feature selection policies

Gabriel Bernardino, Anders Jonsson, Patrick Clarysse, Nicolas Duchateau

Abstract

Determining the most appropriate features for machine learning predictive models is challenging regarding performance and feature acquisition costs. In particular, global feature choice is limited given that some features will only benefit a subset of instances. In previous work, we proposed a reinforcement learning approach to sequentially recommend which modality to acquire next to reach the best information/cost ratio, based on the instance-specific information already acquired. We formulated the problem as a Markov Decision Process where the state's dimensionality changes during the episode, avoiding data imputation, contrary to existing works. However, this only allowed processing a small number of features, as all possible combinations of features were considered. Here, we address these limitations with two contributions: 1) we expand our framework to larger datasets with a heuristic-based strategy that focuses on the most promising feature combinations, and 2) we introduce a post-fit regularisation strategy that reduces the number of different feature combinations, leading to compact sequences of decisions. We tested our method on four binary classification datasets (one involving high-dimensional variables), the largest of which had 56 features and 4500 samples. We obtained better performance than state-of-the-art methods, both in terms of accuracy and policy complexity.

Sampling-guided exploration of active feature selection policies

Abstract

Determining the most appropriate features for machine learning predictive models is challenging regarding performance and feature acquisition costs. In particular, global feature choice is limited given that some features will only benefit a subset of instances. In previous work, we proposed a reinforcement learning approach to sequentially recommend which modality to acquire next to reach the best information/cost ratio, based on the instance-specific information already acquired. We formulated the problem as a Markov Decision Process where the state's dimensionality changes during the episode, avoiding data imputation, contrary to existing works. However, this only allowed processing a small number of features, as all possible combinations of features were considered. Here, we address these limitations with two contributions: 1) we expand our framework to larger datasets with a heuristic-based strategy that focuses on the most promising feature combinations, and 2) we introduce a post-fit regularisation strategy that reduces the number of different feature combinations, leading to compact sequences of decisions. We tested our method on four binary classification datasets (one involving high-dimensional variables), the largest of which had 56 features and 4500 samples. We obtained better performance than state-of-the-art methods, both in terms of accuracy and policy complexity.
Paper Structure (31 sections, 13 equations, 7 figures, 1 table, 3 algorithms)

This paper contains 31 sections, 13 equations, 7 figures, 1 table, 3 algorithms.

Figures (7)

  • Figure 1: Overview of our proposed reinforcement learning approach for active feature selection, in the context of medical data. (a) The decision of using a new acquisition/measurement represents the action at each state, guided by a reward that combines the diagnosis accuracy and the cost associated to acquiring these features, as introduced in our previous work Bernardino2022ReinforcementDiagnosis and consolidated here (Sections \ref{['sec:MDP']} and \ref{['sec:activeFeatureSelection']}. (b) Proposed policy exploration based on a heuristic, very relevant for large number of features (Section \ref{['sec:policyExploration']}). (c) Proposed regularisation strategy to reduce the total number of different action sequences (Section \ref{['sec:policyRegularisation']}).
  • Figure 2: Diagnostic accuracy, measured by the AUC, of our method compared to EDDI and Wang's methods, on the four studied datasets (result of acquiring a maximum of 3 features).
  • Figure 3: Modality usage (probability that a modality is acquired,averaged over the full testing population in 10 different train-test splits) in the "Heart" dataset. Features are sorted by their cost (from left to right), whose values are summarized in the first plot. We can see that when real costs are used (middle column), the algorithm prefers cheaper variables (the ones located on the left area of the plot), while when each variable has the same cost (right column), then more expensive variables are used (the ones located on the right area of the plot).
  • Figure 4: Mean accuracy as a function of the mean cost for our method ("sampling", blue) and Wang's ("clustering", red), for different values of the parameter $\lambda$, for the "Heart" dataset. We tested different $\lambda$ which means steering the balance between cost and accuracy in the reward (each represented as a point). Our approach allows smoother transition, where expensive features are acquired only for difficult cases, whereas Wang's approach has a more dichotomic behaviour between high and low cost.
  • Figure 5: Evolution of the mean return accuracy (left), mean total cost of all acquired features (center) and policy depth (right) for two exploration strategies: random (blue) and heuristic (orange), on the "Spam" dataset. The upper row depicts an example with a low acquisition cost per modality ($0.01$ for each modality, while the cost of a misclassification is set to a unit), while the lower corresponds to a greater acquisition cost ($0.05$).
  • ...and 2 more figures