Table of Contents
Fetching ...

Partition Trees: Conditional Density Estimation over General Outcome Spaces

Felipe Angelim, Alessandro Leite

TL;DR

Partition Trees introduce a unified, nonparametric framework for conditional density estimation over general (mixed-type) outcome spaces by modeling densities as piecewise-constant on data-adaptive partitions defined via Radon–Nikodym derivatives. The method grows trees greedily to maximize a conditional log-loss objective, with an ensemble variant called Partition Forests that averages densities to improve probabilistic predictions. The authors establish $L^1( u)$-consistency under standard growth/shrinkage conditions and finite VC assumptions, and show empirically that Partition Trees/Forests deliver competitive probabilistic performance on classification and regression benchmarks relative to CART-based trees, CADET, CDTree, and Random/XGBoost families, while demonstrating robustness to noise and feature redundancy. The work also provides an efficient, scalable algorithm for joint X- and Y-splits and releases an implementation for practical use and further research.

Abstract

We propose Partition Trees, a tree-based framework for conditional density estimation over general outcome spaces, supporting both continuous and categorical variables within a unified formulation. Our approach models conditional distributions as piecewise-constant densities on data adaptive partitions and learns trees by directly minimizing conditional negative log-likelihood. This yields a scalable, nonparametric alternative to existing probabilistic trees that does not make parametric assumptions about the target distribution. We further introduce Partition Forests, an ensemble extension obtained by averaging conditional densities. Empirically, we demonstrate improved probabilistic prediction over CART-style trees and competitive or superior performance compared to state-of-the-art probabilistic tree methods and Random Forests, along with robustness to redundant features and heteroscedastic noise.

Partition Trees: Conditional Density Estimation over General Outcome Spaces

TL;DR

Partition Trees introduce a unified, nonparametric framework for conditional density estimation over general (mixed-type) outcome spaces by modeling densities as piecewise-constant on data-adaptive partitions defined via Radon–Nikodym derivatives. The method grows trees greedily to maximize a conditional log-loss objective, with an ensemble variant called Partition Forests that averages densities to improve probabilistic predictions. The authors establish -consistency under standard growth/shrinkage conditions and finite VC assumptions, and show empirically that Partition Trees/Forests deliver competitive probabilistic performance on classification and regression benchmarks relative to CART-based trees, CADET, CDTree, and Random/XGBoost families, while demonstrating robustness to noise and feature redundancy. The work also provides an efficient, scalable algorithm for joint X- and Y-splits and releases an implementation for practical use and further research.

Abstract

We propose Partition Trees, a tree-based framework for conditional density estimation over general outcome spaces, supporting both continuous and categorical variables within a unified formulation. Our approach models conditional distributions as piecewise-constant densities on data adaptive partitions and learns trees by directly minimizing conditional negative log-likelihood. This yields a scalable, nonparametric alternative to existing probabilistic trees that does not make parametric assumptions about the target distribution. We further introduce Partition Forests, an ensemble extension obtained by averaging conditional densities. Empirically, we demonstrate improved probabilistic prediction over CART-style trees and competitive or superior performance compared to state-of-the-art probabilistic tree methods and Random Forests, along with robustness to redundant features and heteroscedastic noise.
Paper Structure (61 sections, 15 theorems, 133 equations, 6 figures, 8 tables, 1 algorithm)

This paper contains 61 sections, 15 theorems, 133 equations, 6 figures, 8 tables, 1 algorithm.

Key Result

Theorem 3.1

Let $\mathcal{D}_N = \{(X_i, Y_i)\}_{i=1}^N = \{\mathcal{Z}_i\}_{i=1}^N$ be a set of observations belonging to $\mathcal{Z} := \mathcal{X} \times \mathcal{Y}$ with joint distribution $\mathbb P_{XY} \ll \nu$. Consider $\bar{\mathcal{Y}}_N$ a $\nu$-measurable truncation of $\mathcal{Y}$ as in Equatio Then almost surely.

Figures (6)

  • Figure 1: Illustration of a partition of the joint space $\mathcal{Z}=\mathcal{X}\times\mathcal{Y}$. Each leaf of the tree corresponds to a rectangular cell in the induced partition (right). For a fixed query $x=t$, the leaves whose $X$-projection contains $t$ form a histogram over $\mathcal{Y}$: the associated $Y$-intervals $A_Y$ are the bins, and the estimator is constant on each bin. In particular, for $x\le t_1$ the slice $x=t$ intersects two leaves, yielding a two-bin histogram defined by $A_{11}$ and $A_{12}$; for $t_1<x\le t_2$, it intersects the leaves $A_{21}$ and $A_{12}$. On each cell, the conditional density is estimated from empirical counts normalized by the $\mathcal{Y}$-volume $\mu_Y(A_Y)$.
  • Figure 2: Negative log-likelihood of probabilistic tree models on the Concrete Compressive Strength dataset. As the number of noisy features increases, the models exhibit different sensitivity patterns. Shaded bands denote the minimum and maximum values across five cross-validation folds.
  • Figure 3: Performance of probabilistic tree models under increasing label-noise magnitude on the Concrete Compress Strength dataset. Shaded bands indicate the minimum and maximum negative log-likelihood across five cross-validation folds.
  • Figure 4: Training time comparison on the Physicochemical Protein dataset across different sample sizes. Solid lines indicate mean runtime; shaded regions show the range across five runs. Partition Tree demonstrates consistent computational efficiency and scales favorably compared to both CADET and CDTree.
  • Figure 5: Normalized gain-based feature importances for Partition Tree vs. CADET across regression datasets for the first fold of the cross-validation. Each marker corresponds to a single feature on a given dataset; the dashed line indicates equal importance. The Air Quality dataset is omitted because CADET returned NaN feature importances.
  • ...and 1 more figures

Theorems & Definitions (31)

  • Theorem 3.1
  • proof
  • Theorem 3.2
  • proof : Proof sketch
  • Corollary 3.2
  • proof
  • Proposition 1.1: Population gain is a Jensen Gap
  • proof
  • Corollary 1.2: Non-negativity of the gain
  • proof
  • ...and 21 more