Table of Contents
Fetching ...

Bootstrap Sampling Rate Greater than 1.0 May Improve Random Forest Performance

Stanisław Kaźmierczak, Jacek Mańdziuk

TL;DR

This study revisits bootstrap sampling in random forests by exploring bootstrap rates $BR$ beyond the conventional $BR=1.0$, demonstrating that higher $BR$ values frequently improve classification accuracy on diverse datasets. Through extensive experiments on 36 datasets with 18 RF configurations and BR values from $0.2$ to $5.0$, the authors show that the optimal $BR$ is largely dataset-dependent and often exceeds $1.0$, challenging prior conclusions. They connect BR performance to the learned leaf structure via a proximity order framework and corroborate this with Manhattan-distance neighborhood analyses, identifying ratio-based neighborhood statistics that correlate with $BR^ ext{opt}$. The work also documents practical implications, such as sublinear training-time growth with increasing $BR$ and the need for RF implementations to support $BR > 1.0$, offering a path toward predictive BR selection based on dataset characteristics.

Abstract

Random forests (RFs) utilize bootstrap sampling to generate individual training sets for each component tree by sampling with replacement, with the sample size typically equal to that of the original training set ($N$). Previous research indicates that drawing fewer than $N$ observations can also yield satisfactory results. The ratio of the number of observations in each bootstrap sample to the total number of training instances is referred to as the bootstrap rate (BR). Sampling more than $N$ observations (BR $>$ 1.0) has been explored only to a limited extent and has generally been considered ineffective. In this paper, we revisit this setup using 36 diverse datasets, evaluating BR values ranging from 1.2 to 5.0. Contrary to previous findings, we show that higher BR values can lead to statistically significant improvements in classification accuracy compared to standard settings (BR $\leq$ 1.0). Furthermore, we analyze how BR affects the leaf structure of decision trees within the RF and investigate factors influencing the optimal BR. Our results indicate that the optimal BR is primarily determined by the characteristics of the data set rather than the RF hyperparameters.

Bootstrap Sampling Rate Greater than 1.0 May Improve Random Forest Performance

TL;DR

This study revisits bootstrap sampling in random forests by exploring bootstrap rates beyond the conventional , demonstrating that higher values frequently improve classification accuracy on diverse datasets. Through extensive experiments on 36 datasets with 18 RF configurations and BR values from to , the authors show that the optimal is largely dataset-dependent and often exceeds , challenging prior conclusions. They connect BR performance to the learned leaf structure via a proximity order framework and corroborate this with Manhattan-distance neighborhood analyses, identifying ratio-based neighborhood statistics that correlate with . The work also documents practical implications, such as sublinear training-time growth with increasing and the need for RF implementations to support , offering a path toward predictive BR selection based on dataset characteristics.

Abstract

Random forests (RFs) utilize bootstrap sampling to generate individual training sets for each component tree by sampling with replacement, with the sample size typically equal to that of the original training set (). Previous research indicates that drawing fewer than observations can also yield satisfactory results. The ratio of the number of observations in each bootstrap sample to the total number of training instances is referred to as the bootstrap rate (BR). Sampling more than observations (BR 1.0) has been explored only to a limited extent and has generally been considered ineffective. In this paper, we revisit this setup using 36 diverse datasets, evaluating BR values ranging from 1.2 to 5.0. Contrary to previous findings, we show that higher BR values can lead to statistically significant improvements in classification accuracy compared to standard settings (BR 1.0). Furthermore, we analyze how BR affects the leaf structure of decision trees within the RF and investigate factors influencing the optimal BR. Our results indicate that the optimal BR is primarily determined by the characteristics of the data set rather than the RF hyperparameters.
Paper Structure (18 sections, 1 equation, 7 figures, 40 tables)

This paper contains 18 sections, 1 equation, 7 figures, 40 tables.

Figures (7)

  • Figure 1: Distribution of the winning BR across all RF configurations (top left) and among individual RF parameterizations.
  • Figure 1: Characteristics of BR curves for datasets not shown in Figure \ref{['fig:BR_characteristics_main']}.
  • Figure 2: BR curves for selected datasets.
  • Figure 3: The relationship between the training time of different RF configurations and the size of BR, averaged across all datasets. For each dataset, the times were normalized so that the training time of RF(nt_500) with BR $=$ 1.0 equals 1.
  • Figure 4: An example illustrating how even small differences in the data can significantly affect the optimal BR value. Both figures (a) and (b) show synthetically generated data using scikit-learn's $make\_classification$ method with the following parameters: $n\_samples = 300$, $n\_features = 2$, $n\_classes = 2$, $n\_clusters\_per\_class = 1$, and $random\_state = 1$. The only difference between them is the value of the $class\_sep$ parameter, which controls class separation. In (a), $class\_sep$ is set to 1.95, while in (b), it is set to 2.0. As a result of this slight difference, the optimal BR in (a) equals 5.0, while in (b), it amounts to 0.2. All other parameters of the $make\_classification$ method remain at their default values.
  • ...and 2 more figures