Bootstrap Sampling Rate Greater than 1.0 May Improve Random Forest Performance

Stanisław Kaźmierczak; Jacek Mańdziuk

Bootstrap Sampling Rate Greater than 1.0 May Improve Random Forest Performance

Stanisław Kaźmierczak, Jacek Mańdziuk

TL;DR

This study revisits bootstrap sampling in random forests by exploring bootstrap rates $BR$ beyond the conventional $BR=1.0$, demonstrating that higher $BR$ values frequently improve classification accuracy on diverse datasets. Through extensive experiments on 36 datasets with 18 RF configurations and BR values from $0.2$ to $5.0$, the authors show that the optimal $BR$ is largely dataset-dependent and often exceeds $1.0$, challenging prior conclusions. They connect BR performance to the learned leaf structure via a proximity order framework and corroborate this with Manhattan-distance neighborhood analyses, identifying ratio-based neighborhood statistics that correlate with $BR^ ext{opt}$. The work also documents practical implications, such as sublinear training-time growth with increasing $BR$ and the need for RF implementations to support $BR > 1.0$, offering a path toward predictive BR selection based on dataset characteristics.

Abstract

Random forests (RFs) utilize bootstrap sampling to generate individual training sets for each component tree by sampling with replacement, with the sample size typically equal to that of the original training set ($N$). Previous research indicates that drawing fewer than $N$ observations can also yield satisfactory results. The ratio of the number of observations in each bootstrap sample to the total number of training instances is referred to as the bootstrap rate (BR). Sampling more than $N$ observations (BR $>$ 1.0) has been explored only to a limited extent and has generally been considered ineffective. In this paper, we revisit this setup using 36 diverse datasets, evaluating BR values ranging from 1.2 to 5.0. Contrary to previous findings, we show that higher BR values can lead to statistically significant improvements in classification accuracy compared to standard settings (BR $\leq$ 1.0). Furthermore, we analyze how BR affects the leaf structure of decision trees within the RF and investigate factors influencing the optimal BR. Our results indicate that the optimal BR is primarily determined by the characteristics of the data set rather than the RF hyperparameters.

Bootstrap Sampling Rate Greater than 1.0 May Improve Random Forest Performance

TL;DR

This study revisits bootstrap sampling in random forests by exploring bootstrap rates

beyond the conventional

, demonstrating that higher

values frequently improve classification accuracy on diverse datasets. Through extensive experiments on 36 datasets with 18 RF configurations and BR values from

, the authors show that the optimal

is largely dataset-dependent and often exceeds

, challenging prior conclusions. They connect BR performance to the learned leaf structure via a proximity order framework and corroborate this with Manhattan-distance neighborhood analyses, identifying ratio-based neighborhood statistics that correlate with

. The work also documents practical implications, such as sublinear training-time growth with increasing

and the need for RF implementations to support

, offering a path toward predictive BR selection based on dataset characteristics.

Abstract

). Previous research indicates that drawing fewer than

observations can also yield satisfactory results. The ratio of the number of observations in each bootstrap sample to the total number of training instances is referred to as the bootstrap rate (BR). Sampling more than

observations (BR

1.0) has been explored only to a limited extent and has generally been considered ineffective. In this paper, we revisit this setup using 36 diverse datasets, evaluating BR values ranging from 1.2 to 5.0. Contrary to previous findings, we show that higher BR values can lead to statistically significant improvements in classification accuracy compared to standard settings (BR

1.0). Furthermore, we analyze how BR affects the leaf structure of decision trees within the RF and investigate factors influencing the optimal BR. Our results indicate that the optimal BR is primarily determined by the characteristics of the data set rather than the RF hyperparameters.

Paper Structure (18 sections, 1 equation, 7 figures, 40 tables)

This paper contains 18 sections, 1 equation, 7 figures, 40 tables.

Introduction
Related Literature
Hyperparameter Optimization
Random Forest Hyperparameter Tuning
Experiment Configuration
Results
High vs. Standard BR Values
Distribution of Optimal BR Hyperparameter Values
Analysis of BR Curve Shapes
Time Performance Across BR Values
Understanding the Optimal BR
Proximity Order and Its Impact on the Leaf Structure
Empirical Analysis of Neighborhood Structure Using Manhattan Distance
Impact of Closer and More Distant Neighborhoods on Optimal BR
Limitations of Neighborhood-Based Analysis
...and 3 more sections

Figures (7)

Figure 1: Distribution of the winning BR across all RF configurations (top left) and among individual RF parameterizations.
Figure 1: Characteristics of BR curves for datasets not shown in Figure \ref{['fig:BR_characteristics_main']}.
Figure 2: BR curves for selected datasets.
Figure 3: The relationship between the training time of different RF configurations and the size of BR, averaged across all datasets. For each dataset, the times were normalized so that the training time of RF(nt_500) with BR $=$ 1.0 equals 1.
Figure 4: An example illustrating how even small differences in the data can significantly affect the optimal BR value. Both figures (a) and (b) show synthetically generated data using scikit-learn's $make\_classification$ method with the following parameters: $n\_samples = 300$, $n\_features = 2$, $n\_classes = 2$, $n\_clusters\_per\_class = 1$, and $random\_state = 1$. The only difference between them is the value of the $class\_sep$ parameter, which controls class separation. In (a), $class\_sep$ is set to 1.95, while in (b), it is set to 2.0. As a result of this slight difference, the optimal BR in (a) equals 5.0, while in (b), it amounts to 0.2. All other parameters of the $make\_classification$ method remain at their default values.
...and 2 more figures

Bootstrap Sampling Rate Greater than 1.0 May Improve Random Forest Performance

TL;DR

Abstract

Bootstrap Sampling Rate Greater than 1.0 May Improve Random Forest Performance

Authors

TL;DR

Abstract

Table of Contents

Figures (7)