Riemann-Lebesgue Forest for Regression
Tian Qin, Wei-Min Huang
TL;DR
The paper introduces Riemann-Lebesgue Forest (RLF), a regression ensemble that blends Lebesgue-type splitting of the response with traditional Riemann-type splits on predictors via a Bernoulli switch to regularize the use of Lebesgue cuts. It formalizes the Riemann-Lebesgue Tree (RLT) and constructs a forest using subsampling and incomplete $U$-statistics, with a local RF model at leaves to estimate conditional means. Theoretical contributions include a variance-reduction guarantee showing Lebesgue cuts yield at least as much variance reduction as CART cuts (Theorem 1) and Berry-Esseen-type bounds for the asymptotic normality of RLF in small-sample regimes (Theorem 3), along with a complexity analysis. Empirically, RLF demonstrates competitive performance against Random Forest, especially in sparse/high-noise settings, and shows improved or comparable MSE across a range of real and synthetic datasets, with tunable $\tilde p$ offering practical gains. This work provides a new base learner that leverages response-space information to improve regression ensembles and suggests directions for faster local modeling and boosting-type extensions.
Abstract
We propose a novel ensemble method called Riemann-Lebesgue Forest (RLF) for regression. The core idea in RLF is to mimic the way how a measurable function can be approximated by partitioning its range into a few intervals. With this idea in mind, we develop a new tree learner named Riemann-Lebesgue Tree (RLT) which has a chance to perform Lebesgue type cutting,i.e splitting the node from response $Y$ at certain non-terminal nodes. We show that the optimal Lebesgue type cutting results in larger variance reduction in response $Y$ than ordinary CART \cite{Breiman1984ClassificationAR} cutting (an analogue of Riemann partition). Such property is beneficial to the ensemble part of RLF. We also generalize the asymptotic normality of RLF under different parameter settings. Two one-dimensional examples are provided to illustrate the flexibility of RLF. The competitive performance of RLF against original random forest \cite{Breiman2001RandomF} is demonstrated by experiments in simulation data and real world datasets.
