Table of Contents
Fetching ...

Conditional Density Estimation with Histogram Trees

Lincen Yang, Matthijs van Leeuwen

TL;DR

The Conditional Density Tree is proposed, a fully non-parametric model consisting of a decision tree in which each leaf is formed by a histogram model, formalized using the minimum description length (MDL) principle, which eliminates the need for tuning the hyperparameter for regularization.

Abstract

Conditional density estimation (CDE) goes beyond regression by modeling the full conditional distribution, providing a richer understanding of the data than just the conditional mean in regression. This makes CDE particularly useful in critical application domains. However, interpretable CDE methods are understudied. Current methods typically employ kernel-based approaches, using kernel functions directly for kernel density estimation or as basis functions in linear models. In contrast, despite their conceptual simplicity and visualization suitability, tree-based methods -- which are arguably more comprehensible -- have been largely overlooked for CDE tasks. Thus, we propose the Conditional Density Tree (CDTree), a fully non-parametric model consisting of a decision tree in which each leaf is formed by a histogram model. Specifically, we formalize the problem of learning a CDTree using the minimum description length (MDL) principle, which eliminates the need for tuning the hyperparameter for regularization. Next, we propose an iterative algorithm that, although greedily, searches the optimal histogram for every possible node split. Our experiments demonstrate that, in comparison to existing interpretable CDE methods, CDTrees are both more accurate (as measured by the log-loss) and more robust against irrelevant features. Further, our approach leads to smaller tree sizes than existing tree-based models, which benefits interpretability.

Conditional Density Estimation with Histogram Trees

TL;DR

The Conditional Density Tree is proposed, a fully non-parametric model consisting of a decision tree in which each leaf is formed by a histogram model, formalized using the minimum description length (MDL) principle, which eliminates the need for tuning the hyperparameter for regularization.

Abstract

Conditional density estimation (CDE) goes beyond regression by modeling the full conditional distribution, providing a richer understanding of the data than just the conditional mean in regression. This makes CDE particularly useful in critical application domains. However, interpretable CDE methods are understudied. Current methods typically employ kernel-based approaches, using kernel functions directly for kernel density estimation or as basis functions in linear models. In contrast, despite their conceptual simplicity and visualization suitability, tree-based methods -- which are arguably more comprehensible -- have been largely overlooked for CDE tasks. Thus, we propose the Conditional Density Tree (CDTree), a fully non-parametric model consisting of a decision tree in which each leaf is formed by a histogram model. Specifically, we formalize the problem of learning a CDTree using the minimum description length (MDL) principle, which eliminates the need for tuning the hyperparameter for regularization. Next, we propose an iterative algorithm that, although greedily, searches the optimal histogram for every possible node split. Our experiments demonstrate that, in comparison to existing interpretable CDE methods, CDTrees are both more accurate (as measured by the log-loss) and more robust against irrelevant features. Further, our approach leads to smaller tree sizes than existing tree-based models, which benefits interpretability.

Paper Structure

This paper contains 30 sections, 1 theorem, 7 equations, 6 figures, 3 tables, 3 algorithms.

Key Result

Proposition 1

Let $\theta = (\alpha^1, ..., \alpha^K)$ be the histogram parameters for histograms on all leaves, and let $\hat{\theta} = \arg\max_\theta P_{M, \theta} (y^n| x^n)$. Then $\int_{y^n} \max_{\theta} P_{M, \theta} (y^n|x^n) = \prod_{k \in [K]} \mathcal{R}(N_k, h_k)$, in which $\mathcal{R}(N_k, h_k)$ is

Figures (6)

  • Figure 1: Three selected leaves from the CDTree modeling the conditional density of the medical costs given demographic features, together with the unconditional density for medical costs.
  • Figure 2: Left: the number of leaves for tree-based methods. Right: Runtimes of CDTree and kernel-based methods. Note that the y-axes are scaled by $\log_{10}(.)$
  • Figure 3: Number of internal nodes with split conditions that contain irrelevant features. The y-axis is scaled by the squared-root for better visualization.
  • Figure 4: Negative log-likelihoods with different number of irrelevant 'dependent noisy' features. The results of CDTree (shown in blue lines) are stable on all datasets.
  • Figure 5: Negative-log-likelihoods with different number of added features, which are generated by independent Gaussian distributions. The results of CDTree, shown in blue solid lines, are extremely stable on all datasets expect for the very small 'slump' dataset.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof