Table of Contents
Fetching ...

Consistency of Oblique Decision Tree and its Boosting and Random Forest

Haoran Zhan, Yu Liu, Yingcun Xia

TL;DR

The paper establishes rigorous consistency results for Oblique Decision Trees (ODT) and their ensembles, showing that ODT is consistent for general $L^2$-integrable regression functions, and that ODT-based Random Forests (ODRF) retain consistency under both partially and fully grown-tree regimes. It extends to an ensemble of gradient boosting trees (ODBT) built via orthogonal matching pursuit, proving consistency and, under certain function classes, fast rates. The authors also introduce two feature-bagging schemes, refine ODRF implementations, and provide extensive real-data experiments demonstrating improvements over standard RF and related forests. Collectively, the work unifies oblique partitions with ensemble methods under solid theoretical guarantees and practical gains, bridging tree-based methods with neural-network-like representations to achieve strong performance.

Abstract

Classification and Regression Tree (CART), Random Forest (RF) and Gradient Boosting Tree (GBT) are probably the most popular set of statistical learning methods. However, their statistical consistency can only be proved under very restrictive assumptions on the underlying regression function. As an extension to standard CART, the oblique decision tree (ODT), which uses linear combinations of predictors as partitioning variables, has received much attention. ODT tends to perform numerically better than CART and requires fewer partitions. In this paper, we show that ODT is consistent for very general regression functions as long as they are $L^2$ integrable. Then, we prove the consistency of the ODT-based random forest (ODRF), whether fully grown or not. Finally, we propose an ensemble of GBT for regression by borrowing the technique of orthogonal matching pursuit and study its consistency under very mild conditions on the tree structure. After refining existing computer packages according to the established theory, extensive experiments on real data sets show that both our ensemble boosting trees and ODRF have noticeable overall improvements over RF and other forests.

Consistency of Oblique Decision Tree and its Boosting and Random Forest

TL;DR

The paper establishes rigorous consistency results for Oblique Decision Trees (ODT) and their ensembles, showing that ODT is consistent for general -integrable regression functions, and that ODT-based Random Forests (ODRF) retain consistency under both partially and fully grown-tree regimes. It extends to an ensemble of gradient boosting trees (ODBT) built via orthogonal matching pursuit, proving consistency and, under certain function classes, fast rates. The authors also introduce two feature-bagging schemes, refine ODRF implementations, and provide extensive real-data experiments demonstrating improvements over standard RF and related forests. Collectively, the work unifies oblique partitions with ensemble methods under solid theoretical guarantees and practical gains, bridging tree-based methods with neural-network-like representations to achieve strong performance.

Abstract

Classification and Regression Tree (CART), Random Forest (RF) and Gradient Boosting Tree (GBT) are probably the most popular set of statistical learning methods. However, their statistical consistency can only be proved under very restrictive assumptions on the underlying regression function. As an extension to standard CART, the oblique decision tree (ODT), which uses linear combinations of predictors as partitioning variables, has received much attention. ODT tends to perform numerically better than CART and requires fewer partitions. In this paper, we show that ODT is consistent for very general regression functions as long as they are integrable. Then, we prove the consistency of the ODT-based random forest (ODRF), whether fully grown or not. Finally, we propose an ensemble of GBT for regression by borrowing the technique of orthogonal matching pursuit and study its consistency under very mild conditions on the tree structure. After refining existing computer packages according to the established theory, extensive experiments on real data sets show that both our ensemble boosting trees and ODRF have noticeable overall improvements over RF and other forests.
Paper Structure (30 sections, 20 theorems, 231 equations, 4 figures, 5 tables, 3 algorithms)

This paper contains 30 sections, 20 theorems, 231 equations, 4 figures, 5 tables, 3 algorithms.

Key Result

Theorem 3.1

Assume ${\mathbf E}(e^{c\cdot Y^2})<\infty$ for some $c>0$, and that $m(X)$ is $L^2$ integrable. If $t_n\to\infty$ and $t_n =o\left(\frac{n}{\ln^4{n}}\right)$, we have

Figures (4)

  • Figure 1: This figure shows an example of $T_{\mathcal{D}_n,6,3}$, which has three layers ${\cal L}=3$ and $6$ leaves. To be specific, we have root node $\mathbb{A}_0^1$ in the layer 0, nodes $\mathbb{A}_1^1$ and $\mathbb{A}_1^2$ in the layer 1, nodes $\mathbb{A}_2^1, \mathbb{A}_2^2, \mathbb{A}_2^3, \mathbb{A}_2^4$ in the layer $\ell = 2$ and leaves $\mathbb{A}_3^1, \mathbb{A}_3^2, \mathbb{A}_3^3, \mathbb{A}_3^4, \mathbb{A}_3^5, \mathbb{A}_3^6, \mathbb{A}_3^7$ in the layer 3. Note that in this case $\mathbb{A}_2^2$ only contains one data point and can not be divided in further steps, which implies that $\mathbb{A}_2^2=\mathbb{A}_3^3$. It is also noteworthy that no matter how many data points in $\mathbb{A}_2^4$ we have $\mathbb{A}_2^4=\mathbb{A}_3^6$ because $t_n$ is preset to be 6. Finally, we have estimators $m_{n,2}(x)=\sum_{j=1}^4{\mathbb{I}(x\in \mathbb{A}_2^j)\cdot \bar{Y}_{\mathbb{A}_2^j}}$ and $m_{n,3}(x)=\sum_{j=1}^6{\mathbb{I}(x\in \mathbb{A}_3^j)\cdot \bar{Y}_{\mathbb{A}^j_3}}$ given data ${\mathcal{D}}_n$.
  • Figure S.1: This is an example of the situation (b). Here, $\bm{d}_1=(\theta_1,s_1)$ divides $[0,1]^p$ into two parts where the left part $A_{L,1}= I\cup IV$ and the right part $A_{R,1}= II\cup III$, while cut $\bm{d}_1'=(\theta_2,s_2)$ divides $[0,1]^p$ into another two parts where the left one $A_{L,2}= I\cup III$ and the right one $A_{R,2}= II\cup IV$.
  • Figure S.2: In this case, $A(x,(d_1,d_2,d_3))= \triangle ABC$ and $A(x,(d_1,d_2,d_3'))= \triangle ADE$. The cut $d_4'$ divides $\triangle ABC$ into two daughters, where $\triangle ABC_L=\Box AFGC$ and $\triangle ABC_R=\triangle GFB$. Meanwhile, the cut $d_4'$ divides $\triangle ADE$ into two daughters, where $\triangle ABC_L=\Box ADHO$ and $\triangle ABC_R=\triangle HOE$.
  • Figure S.3: This ODT has two layers and three leaves denoted by $\mathbb{A}_2^1, \mathbb{A}_2^2, \mathbb{A}_2^3$. Note that $\mathbb{A}_1^1$ is not partitioned anymore and thus $\mathbb{A}_1^1=\mathbb{A}_2^1$. Meanwhile, it can be seen that $\mathbb{A}_2^1=\{x:\theta_1^Tx\le s_1\}$, $\mathbb{A}_2^2=\{x:\theta_1^Tx> s_1\}\cap\{x:\theta_2^Tx\le s_2\}$ and $\mathbb{A}_2^3=\{x:\theta_1^Tx> s_1\}\cap\{x:\theta_2^Tx> s_2\}$.

Theorems & Definitions (31)

  • Theorem 3.1: Consistency of ODT before pruning
  • Remark 1
  • Definition 3.2: blumer1989learnability
  • Definition 3.3: gyorfi2006distribution
  • Lemma 3.4: bagirov2009estimation
  • Lemma 3.5
  • proof
  • Definition 3.6
  • Lemma 3.7: gyorfi2006distribution
  • Lemma 3.8
  • ...and 21 more