Riemann-Lebesgue Forest for Regression

Tian Qin; Wei-Min Huang

Riemann-Lebesgue Forest for Regression

Tian Qin, Wei-Min Huang

TL;DR

The paper introduces Riemann-Lebesgue Forest (RLF), a regression ensemble that blends Lebesgue-type splitting of the response with traditional Riemann-type splits on predictors via a Bernoulli switch to regularize the use of Lebesgue cuts. It formalizes the Riemann-Lebesgue Tree (RLT) and constructs a forest using subsampling and incomplete $U$-statistics, with a local RF model at leaves to estimate conditional means. Theoretical contributions include a variance-reduction guarantee showing Lebesgue cuts yield at least as much variance reduction as CART cuts (Theorem 1) and Berry-Esseen-type bounds for the asymptotic normality of RLF in small-sample regimes (Theorem 3), along with a complexity analysis. Empirically, RLF demonstrates competitive performance against Random Forest, especially in sparse/high-noise settings, and shows improved or comparable MSE across a range of real and synthetic datasets, with tunable $\tilde p$ offering practical gains. This work provides a new base learner that leverages response-space information to improve regression ensembles and suggests directions for faster local modeling and boosting-type extensions.

Abstract

We propose a novel ensemble method called Riemann-Lebesgue Forest (RLF) for regression. The core idea in RLF is to mimic the way how a measurable function can be approximated by partitioning its range into a few intervals. With this idea in mind, we develop a new tree learner named Riemann-Lebesgue Tree (RLT) which has a chance to perform Lebesgue type cutting,i.e splitting the node from response $Y$ at certain non-terminal nodes. We show that the optimal Lebesgue type cutting results in larger variance reduction in response $Y$ than ordinary CART \cite{Breiman1984ClassificationAR} cutting (an analogue of Riemann partition). Such property is beneficial to the ensemble part of RLF. We also generalize the asymptotic normality of RLF under different parameter settings. Two one-dimensional examples are provided to illustrate the flexibility of RLF. The competitive performance of RLF against original random forest \cite{Breiman2001RandomF} is demonstrated by experiments in simulation data and real world datasets.

Riemann-Lebesgue Forest for Regression

TL;DR

-statistics, with a local RF model at leaves to estimate conditional means. Theoretical contributions include a variance-reduction guarantee showing Lebesgue cuts yield at least as much variance reduction as CART cuts (Theorem 1) and Berry-Esseen-type bounds for the asymptotic normality of RLF in small-sample regimes (Theorem 3), along with a complexity analysis. Empirically, RLF demonstrates competitive performance against Random Forest, especially in sparse/high-noise settings, and shows improved or comparable MSE across a range of real and synthetic datasets, with tunable

offering practical gains. This work provides a new base learner that leverages response-space information to improve regression ensembles and suggests directions for faster local modeling and boosting-type extensions.

Abstract

at certain non-terminal nodes. We show that the optimal Lebesgue type cutting results in larger variance reduction in response

than ordinary CART \cite{Breiman1984ClassificationAR} cutting (an analogue of Riemann partition). Such property is beneficial to the ensemble part of RLF. We also generalize the asymptotic normality of RLF under different parameter settings. Two one-dimensional examples are provided to illustrate the flexibility of RLF. The competitive performance of RLF against original random forest \cite{Breiman2001RandomF} is demonstrated by experiments in simulation data and real world datasets.

Paper Structure (27 sections, 3 theorems, 49 equations, 4 figures, 8 tables, 2 algorithms)

This paper contains 27 sections, 3 theorems, 49 equations, 4 figures, 8 tables, 2 algorithms.

Introduction
Methodology
Preliminary
Riemann-Lebesgue Tree
Riemann-Lebesgue Forest
Theoretical analysis of RLF
Variance reduction of response $Y$ by Lebesgue cuttings
Convergence rate of the asymptotic normality
Complexity analysis
Experiments
Sparse model
Real Data Performance
Tuning of Splitting control probability $\Tilde{p}$
Extra experiments of RLF with tuned $\Tilde{p}$
Discussion and Limitation
...and 12 more sections

Key Result

Theorem 3.1

Let the regression function be $Y=f(\mathbf{X})+\varepsilon$ ,where $\mathbf{X}\in \mathbb{R}^{d}$, $Y\in \mathbb{R}$ and $f$ is a bound measurable function and $\varepsilon$ is the noise term. Under the procedure defined in riecutting and lebcutting, let $A_{1}^{*}=\{Y>a^{*}\},A_{2}^{*}=\{Y\leq a^{

Figures (4)

Figure 1: Two types of function approximation.(a):"Riemann" type approximation. (b): "Lebesgue" type approximation
Figure 2: Test MSEs for RLF and RF. (a) Test MSE as a function of Number of trees, (b) Test MSE rate as a function of Number of local trees, (c) Test MSE as a function of Number of noisy variables, (d) Test MSE as a function of Subagging ratio.
Figure 3: (a) Test MSEs for RLF and RF as functions of control probability $\Tilde{p}$. Orange points in (b) and (c) represent test samples generated by two models. Blue solid lines are the underlying functions ;Green lines are the predicted curves for optimal RLFs while the red lines are the predicted curve of optimal RF in two examples.
Figure S4: Test MSE curve as function of control probability $\Tilde{p}$ in two examples

Theorems & Definitions (3)

Theorem 3.1
Theorem 3.2
Theorem A.1

Riemann-Lebesgue Forest for Regression

TL;DR

Abstract

Riemann-Lebesgue Forest for Regression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (3)