Table of Contents
Fetching ...

On the Statistical Optimality of Optimal Decision Trees

Zineng Xu, Subhroshekhar Ghosh, Yan Shuo Tan

TL;DR

This work develops a comprehensive statistical theory for ERM trees under random design in both high-dimensional regression and classification and derives minimax optimal rates over a novel function class: the piecewise sparse heterogeneous anisotropic Besov (PSHAB) space.

Abstract

While globally optimal empirical risk minimization (ERM) decision trees have become computationally feasible and empirically successful, rigorous theoretical guarantees for their statistical performance remain limited. In this work, we develop a comprehensive statistical theory for ERM trees under random design in both high-dimensional regression and classification. We first establish sharp oracle inequalities that bound the excess risk of the ERM estimator relative to the best possible approximation achievable by any tree with at most $L$ leaves, thereby characterizing the interpretability-accuracy trade-off. We derive these results using a novel uniform concentration framework based on empirically localized Rademacher complexity. Furthermore, we derive minimax optimal rates over a novel function class: the piecewise sparse heterogeneous anisotropic Besov (PSHAB) space. This space explicitly captures three key structural features encountered in practice: sparsity, anisotropic smoothness, and spatial heterogeneity. While our main results are established under sub-Gaussianity, we also provide robust guarantees that hold under heavy-tailed noise settings. Together, these findings provide a principled foundation for the optimality of ERM trees and introduce empirical process tools broadly applicable to other highly adaptive, data-driven procedures.

On the Statistical Optimality of Optimal Decision Trees

TL;DR

This work develops a comprehensive statistical theory for ERM trees under random design in both high-dimensional regression and classification and derives minimax optimal rates over a novel function class: the piecewise sparse heterogeneous anisotropic Besov (PSHAB) space.

Abstract

While globally optimal empirical risk minimization (ERM) decision trees have become computationally feasible and empirically successful, rigorous theoretical guarantees for their statistical performance remain limited. In this work, we develop a comprehensive statistical theory for ERM trees under random design in both high-dimensional regression and classification. We first establish sharp oracle inequalities that bound the excess risk of the ERM estimator relative to the best possible approximation achievable by any tree with at most leaves, thereby characterizing the interpretability-accuracy trade-off. We derive these results using a novel uniform concentration framework based on empirically localized Rademacher complexity. Furthermore, we derive minimax optimal rates over a novel function class: the piecewise sparse heterogeneous anisotropic Besov (PSHAB) space. This space explicitly captures three key structural features encountered in practice: sparsity, anisotropic smoothness, and spatial heterogeneity. While our main results are established under sub-Gaussianity, we also provide robust guarantees that hold under heavy-tailed noise settings. Together, these findings provide a principled foundation for the optimality of ERM trees and introduce empirical process tools broadly applicable to other highly adaptive, data-driven procedures.
Paper Structure (41 sections, 32 theorems, 215 equations)

This paper contains 41 sections, 32 theorems, 215 equations.

Key Result

Lemma 2.1

The number of valid tree-based partitions with at most $L$ leaves satisfies $|\mathcal{P}_{L}^\mathcal{X}| \leq (dn)^L$.

Theorems & Definitions (112)

  • Lemma 2.1
  • proof
  • Definition 2.2: ERM regression tree estimators
  • Remark 2.3: Notation
  • Definition 2.4: ERM classification tree estimators
  • Remark 2.5
  • Remark 2.6
  • Theorem 3.1: Oracle inequalities for ERM regression trees
  • Remark 3.2: Bias-variance trade-off
  • Remark 3.3: Interpretability-accuracy tradeoff
  • ...and 102 more