Bootstrap Sampling Rate Greater than 1.0 May Improve Random Forest Performance
Stanisław Kaźmierczak, Jacek Mańdziuk
TL;DR
This study revisits bootstrap sampling in random forests by exploring bootstrap rates $BR$ beyond the conventional $BR=1.0$, demonstrating that higher $BR$ values frequently improve classification accuracy on diverse datasets. Through extensive experiments on 36 datasets with 18 RF configurations and BR values from $0.2$ to $5.0$, the authors show that the optimal $BR$ is largely dataset-dependent and often exceeds $1.0$, challenging prior conclusions. They connect BR performance to the learned leaf structure via a proximity order framework and corroborate this with Manhattan-distance neighborhood analyses, identifying ratio-based neighborhood statistics that correlate with $BR^ ext{opt}$. The work also documents practical implications, such as sublinear training-time growth with increasing $BR$ and the need for RF implementations to support $BR > 1.0$, offering a path toward predictive BR selection based on dataset characteristics.
Abstract
Random forests (RFs) utilize bootstrap sampling to generate individual training sets for each component tree by sampling with replacement, with the sample size typically equal to that of the original training set ($N$). Previous research indicates that drawing fewer than $N$ observations can also yield satisfactory results. The ratio of the number of observations in each bootstrap sample to the total number of training instances is referred to as the bootstrap rate (BR). Sampling more than $N$ observations (BR $>$ 1.0) has been explored only to a limited extent and has generally been considered ineffective. In this paper, we revisit this setup using 36 diverse datasets, evaluating BR values ranging from 1.2 to 5.0. Contrary to previous findings, we show that higher BR values can lead to statistically significant improvements in classification accuracy compared to standard settings (BR $\leq$ 1.0). Furthermore, we analyze how BR affects the leaf structure of decision trees within the RF and investigate factors influencing the optimal BR. Our results indicate that the optimal BR is primarily determined by the characteristics of the data set rather than the RF hyperparameters.
