Table of Contents
Fetching ...

DiNo and RanBu: Lightweight Predictions from Shallow Random Forests

Tiago Mendonça dos Santos, Rafael Izbicki, Luís Gustavo Esteves

TL;DR

This work addresses the latency and memory bottlenecks of large Random Forests on tabular data by introducing two shallow-forest kernels, DiNo and RanBu, that convert a fixed, depth-limited forest into distance-weighted predictors using MRCA-based and Breiman proximities, respectively. Predictions are obtained post-training via a Gaussian-style kernel weighting with a single bandwidth parameter $h$, enabling substantial speedups without retraining. Empirical results across synthetic and 25 real-world datasets show RanBu often matches or surpasses full-depth RFs in accuracy while drastically reducing runtime (up to 95% in some settings), with DiNo offering stable gains in low-noise regimes; both extend naturally to conditional quantiles. The methods are open-source, mesh well with existing RF tooling, and preserve interpretability rooted in the tree structure, making them attractive for latency-sensitive deployments and similarity-based tasks such as clustering or anomaly detection.

Abstract

Random Forest ensembles are a strong baseline for tabular prediction tasks, but their reliance on hundreds of deep trees often results in high inference latency and memory demands, limiting deployment in latency-sensitive or resource-constrained environments. We introduce DiNo (Distance with Nodes) and RanBu (Random Bushes), two shallow-forest methods that convert a small set of depth-limited trees into efficient, distance-weighted predictors. DiNo measures cophenetic distances via the most recent common ancestor of observation pairs, while RanBu applies kernel smoothing to Breiman's classical proximity measure. Both approaches operate entirely after forest training: no additional trees are grown, and tuning of the single bandwidth parameter $h$ requires only lightweight matrix-vector operations. Across three synthetic benchmarks and 25 public datasets, RanBu matches or exceeds the accuracy of full-depth random forests-particularly in high-noise settings-while reducing training plus inference time by up to 95\%. DiNo achieves the best bias-variance trade-off in low-noise regimes at a modest computational cost. Both methods extend directly to quantile regression, maintaining accuracy with substantial speed gains. The implementation is available as an open-source R/C++ package at https://github.com/tiagomendonca/dirf. We focus on structured tabular random samples (i.i.d.), leaving extensions to other modalities for future work.

DiNo and RanBu: Lightweight Predictions from Shallow Random Forests

TL;DR

This work addresses the latency and memory bottlenecks of large Random Forests on tabular data by introducing two shallow-forest kernels, DiNo and RanBu, that convert a fixed, depth-limited forest into distance-weighted predictors using MRCA-based and Breiman proximities, respectively. Predictions are obtained post-training via a Gaussian-style kernel weighting with a single bandwidth parameter , enabling substantial speedups without retraining. Empirical results across synthetic and 25 real-world datasets show RanBu often matches or surpasses full-depth RFs in accuracy while drastically reducing runtime (up to 95% in some settings), with DiNo offering stable gains in low-noise regimes; both extend naturally to conditional quantiles. The methods are open-source, mesh well with existing RF tooling, and preserve interpretability rooted in the tree structure, making them attractive for latency-sensitive deployments and similarity-based tasks such as clustering or anomaly detection.

Abstract

Random Forest ensembles are a strong baseline for tabular prediction tasks, but their reliance on hundreds of deep trees often results in high inference latency and memory demands, limiting deployment in latency-sensitive or resource-constrained environments. We introduce DiNo (Distance with Nodes) and RanBu (Random Bushes), two shallow-forest methods that convert a small set of depth-limited trees into efficient, distance-weighted predictors. DiNo measures cophenetic distances via the most recent common ancestor of observation pairs, while RanBu applies kernel smoothing to Breiman's classical proximity measure. Both approaches operate entirely after forest training: no additional trees are grown, and tuning of the single bandwidth parameter requires only lightweight matrix-vector operations. Across three synthetic benchmarks and 25 public datasets, RanBu matches or exceeds the accuracy of full-depth random forests-particularly in high-noise settings-while reducing training plus inference time by up to 95\%. DiNo achieves the best bias-variance trade-off in low-noise regimes at a modest computational cost. Both methods extend directly to quantile regression, maintaining accuracy with substantial speed gains. The implementation is available as an open-source R/C++ package at https://github.com/tiagomendonca/dirf. We focus on structured tabular random samples (i.i.d.), leaving extensions to other modalities for future work.

Paper Structure

This paper contains 30 sections, 30 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Average time (in seconds) required to train each model and generate a prediction for a single test observation, as a function of the training set size. The experiment is based on the friedman simulation setting with 4 informative predictors and 100 noise variables. RanBu and DiNo remain efficient even as sample size increases, while full-depth Random Forests (R.F.) and GRF exhibit substantially higher computational costs. The right-hand panel uses a logarithmic scale on the $y$-axis to better visualize differences across methods. Results are averaged over 50 replications.
  • Figure 2: Ratio of pinball loss to Random Forest across quantiles for DiNo, RanBu, GRF, and Reduced Random Forest. Bandwidth for DiNo and RanBu fixed at $h=0.20$. Across most datasets, DiNo (red) and RanBu (blue) achieve ratios close to or below one, indicating comparable or superior accuracy to the full R.F. baseline. GRF (black) and Reduced R.F. (purple) often yield higher losses, especially at intermediate quantiles. The robustness of DiNo and RanBu across diverse distributions highlights their suitability for quantile regression tasks.
  • Figure 3: Ratio of pinball loss of DiNo to Random Forest across quantiles for different bandwidth values $h$. Performance is sensitive to bandwidth choice: very small $h$ inflates loss at the distribution tails, while very large $h$ tends to degrade accuracy at central quantiles. Intermediate values (e.g., $h=0.15$–$0.20$) generally yield the most stable and competitive performance across datasets.
  • Figure 4: Ratio of pinball loss of RanBu to Random Forest across quantiles for different bandwidth values $h$. Curves are generally closer to one, indicating greater robustness to bandwidth choice and more consistent performance across quantiles. Unlike DiNo, RanBu is less affected by extreme values of $h$, with stable behavior across datasets and competitive results even for non-optimal bandwidths.
  • Figure 5: Multidimensional scaling (MDS) plots and pairwise distance distributions for a subset of the Ames Housing data. Compared with Breiman proximities from a full or reduced random forest, DiNo generates embeddings with clearer structure and a more dispersed distance distribution. This greater spread suggests stronger discriminatory ability, making the distance attractive for clustering and other similarity–based tasks beyond prediction.

Theorems & Definitions (1)

  • proof