Table of Contents
Fetching ...

Effect of hyperparameters on variable selection in random forests

Cesaire J. K. Fouodo, Lea L. Kronziel, Inke R. König, Silke Szymczak

TL;DR

The paper investigates how random forest hyperparameters affect RF-based variable selection methods Vita and Boruta in high-dimensional data. Using two simulation studies—one with a simple block-structured correlation and one with empirical gene-expression-derived correlations—the authors assess how settings for $mtry.prop$, $sample.fraction$, $min.node.size.prop$, and $replace$ influence FDR, sensitivity, and stability. They find that $mtry.prop$ and $sample.fraction$ often have a larger impact on variable selection performance than $min.node.size.prop$ or $replace$, with optimal values depending on the correlation structure; default defaults are not universally optimal for selection tasks. The work provides practical guidance: tune hyperparameters with the data structure in mind and consider slower but more stable settings with smaller $mtry.prop$ and $sample.fraction$ for weakly correlated data, while maintaining replacement sampling as a robust default. Overall, the study advances understanding of when RF defaults aid or hinder variable selection and informs study design for omics analyses where identifying relevant predictors is critical.

Abstract

Random forests (RFs) are well suited for prediction modeling and variable selection in high-dimensional omics studies. The effect of hyperparameters of the RF algorithm on prediction performance and variable importance estimation have previously been investigated. However, how hyperparameters impact RF-based variable selection remains unclear. We evaluate the effects on the Vita and the Boruta variable selection procedures based on two simulation studies utilizing theoretical distributions and empirical gene expression data. We assess the ability of the procedures to select important variables (sensitivity) while controlling the false discovery rate (FDR). Our results show that the proportion of splitting candidate variables and the sample fraction for the training dataset influence the selection procedures more than the drawing strategy of the training datasets and the minimal terminal node size. A suitable setting of the RF hyperparameters depends on the correlation structure in the data. For weakly correlated predictor variables, the default value of the number of splitting variables is optimal, but smaller values of the sample fraction result in larger sensitivity. In contrast, the difference in sensitivity of the optimal compared to the default value of sample fraction is negligible for strongly correlated predictor variables, whereas smaller values than the default are better in the other settings. In conclusion, the default values of the hyperparameters will not always be suitable for identifying important variables. Thus, adequate values differ depending on whether the aim of the study is optimizing prediction performance or variable selection.

Effect of hyperparameters on variable selection in random forests

TL;DR

The paper investigates how random forest hyperparameters affect RF-based variable selection methods Vita and Boruta in high-dimensional data. Using two simulation studies—one with a simple block-structured correlation and one with empirical gene-expression-derived correlations—the authors assess how settings for , , , and influence FDR, sensitivity, and stability. They find that and often have a larger impact on variable selection performance than or , with optimal values depending on the correlation structure; default defaults are not universally optimal for selection tasks. The work provides practical guidance: tune hyperparameters with the data structure in mind and consider slower but more stable settings with smaller and for weakly correlated data, while maintaining replacement sampling as a robust default. Overall, the study advances understanding of when RF defaults aid or hinder variable selection and informs study design for omics analyses where identifying relevant predictors is critical.

Abstract

Random forests (RFs) are well suited for prediction modeling and variable selection in high-dimensional omics studies. The effect of hyperparameters of the RF algorithm on prediction performance and variable importance estimation have previously been investigated. However, how hyperparameters impact RF-based variable selection remains unclear. We evaluate the effects on the Vita and the Boruta variable selection procedures based on two simulation studies utilizing theoretical distributions and empirical gene expression data. We assess the ability of the procedures to select important variables (sensitivity) while controlling the false discovery rate (FDR). Our results show that the proportion of splitting candidate variables and the sample fraction for the training dataset influence the selection procedures more than the drawing strategy of the training datasets and the minimal terminal node size. A suitable setting of the RF hyperparameters depends on the correlation structure in the data. For weakly correlated predictor variables, the default value of the number of splitting variables is optimal, but smaller values of the sample fraction result in larger sensitivity. In contrast, the difference in sensitivity of the optimal compared to the default value of sample fraction is negligible for strongly correlated predictor variables, whereas smaller values than the default are better in the other settings. In conclusion, the default values of the hyperparameters will not always be suitable for identifying important variables. Thus, adequate values differ depending on whether the aim of the study is optimizing prediction performance or variable selection.
Paper Structure (24 sections, 2 equations, 2 figures, 3 tables)

This paper contains 24 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Simulation study 2: Empirical performances of Vita and Boruta for variations of mtry.prop and sample.fraction are shown in the first and second rows, respectively. Each method's average across all replicates is shown for each hyperparameter variation.
  • Figure 2: Simulation study 2: the first row shows the empirical performances of Vita and Boruta for variations of min.node.size.prop, and their second row the empirical performances for variations of replace. Axes are scaled differently.