On the Statistical Optimality of Optimal Decision Trees

Zineng Xu; Subhroshekhar Ghosh; Yan Shuo Tan

On the Statistical Optimality of Optimal Decision Trees

Zineng Xu, Subhroshekhar Ghosh, Yan Shuo Tan

TL;DR

This work develops a comprehensive statistical theory for ERM trees under random design in both high-dimensional regression and classification and derives minimax optimal rates over a novel function class: the piecewise sparse heterogeneous anisotropic Besov (PSHAB) space.

Abstract

While globally optimal empirical risk minimization (ERM) decision trees have become computationally feasible and empirically successful, rigorous theoretical guarantees for their statistical performance remain limited. In this work, we develop a comprehensive statistical theory for ERM trees under random design in both high-dimensional regression and classification. We first establish sharp oracle inequalities that bound the excess risk of the ERM estimator relative to the best possible approximation achievable by any tree with at most $L$ leaves, thereby characterizing the interpretability-accuracy trade-off. We derive these results using a novel uniform concentration framework based on empirically localized Rademacher complexity. Furthermore, we derive minimax optimal rates over a novel function class: the piecewise sparse heterogeneous anisotropic Besov (PSHAB) space. This space explicitly captures three key structural features encountered in practice: sparsity, anisotropic smoothness, and spatial heterogeneity. While our main results are established under sub-Gaussianity, we also provide robust guarantees that hold under heavy-tailed noise settings. Together, these findings provide a principled foundation for the optimality of ERM trees and introduce empirical process tools broadly applicable to other highly adaptive, data-driven procedures.

On the Statistical Optimality of Optimal Decision Trees

TL;DR

Abstract

leaves, thereby characterizing the interpretability-accuracy trade-off. We derive these results using a novel uniform concentration framework based on empirically localized Rademacher complexity. Furthermore, we derive minimax optimal rates over a novel function class: the piecewise sparse heterogeneous anisotropic Besov (PSHAB) space. This space explicitly captures three key structural features encountered in practice: sparsity, anisotropic smoothness, and spatial heterogeneity. While our main results are established under sub-Gaussianity, we also provide robust guarantees that hold under heavy-tailed noise settings. Together, these findings provide a principled foundation for the optimality of ERM trees and introduce empirical process tools broadly applicable to other highly adaptive, data-driven procedures.

Paper Structure (41 sections, 32 theorems, 215 equations)

This paper contains 41 sections, 32 theorems, 215 equations.

Introduction
Fundamentals of tree-based algorithms
Problem formulation
Notation
Vectors, random variables, indexing.
Norms and inner products.
Constants and asymptotic notation.
Cells, volumes and side lengths.
Partitions
Decision trees
Oracle inequalities
Oracle inequalities for regression
Oracle inequalities for classification
Piecewise sparse heterogeneous anisotropic Besov spaces
Approximation bounds over PSHAB spaces
...and 26 more sections

Key Result

Lemma 2.1

The number of valid tree-based partitions with at most $L$ leaves satisfies $|\mathcal{P}_{L}^\mathcal{X}| \leq (dn)^L$.

Theorems & Definitions (112)

Lemma 2.1
proof
Definition 2.2: ERM regression tree estimators
Remark 2.3: Notation
Definition 2.4: ERM classification tree estimators
Remark 2.5
Remark 2.6
Theorem 3.1: Oracle inequalities for ERM regression trees
Remark 3.2: Bias-variance trade-off
Remark 3.3: Interpretability-accuracy tradeoff
...and 102 more

On the Statistical Optimality of Optimal Decision Trees

TL;DR

Abstract

On the Statistical Optimality of Optimal Decision Trees

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (112)