Table of Contents
Fetching ...

Bayesian Additive Distribution Regression

Antonio R. Linero, Soumyabrata Bose, Jared Murray

TL;DR

It is argued that shallow decision tree ensembles encode reasonable inductive biases for tabular data, making them appropriate in settings where the functional depends primarily on low-dimensional marginals of the distributions.

Abstract

Distribution regression, where the goal is to predict a scalar response from a distribution-valued predictor, arises naturally in settings where observations are grouped and outcomes depend on group-level characteristics rather than on individual measurements. We introduce DistBART, a Bayesian nonparametric approach to distribution regression that models the regression function as a linear functional with the Riesz representer assigned a Bayesian additive regression trees (BART) prior. We argue that shallow decision tree ensembles encode reasonable inductive biases for tabular data, making them appropriate in settings where the functional depends primarily on low-dimensional marginals of the distributions. We show this both empirically on synthetic and real data and theoretically through an adaptive posterior concentration result. We also establish connections to kernel methods, and use this connection to motivate variants of DistBART that can learn nonlinear functionals. To enable scalability to large datasets, we develop a random-feature approximation that samples trees from the BART prior and reduces inference to sparse Bayesian linear regression, achieving computational efficiency while retaining uncertainty quantification.

Bayesian Additive Distribution Regression

TL;DR

It is argued that shallow decision tree ensembles encode reasonable inductive biases for tabular data, making them appropriate in settings where the functional depends primarily on low-dimensional marginals of the distributions.

Abstract

Distribution regression, where the goal is to predict a scalar response from a distribution-valued predictor, arises naturally in settings where observations are grouped and outcomes depend on group-level characteristics rather than on individual measurements. We introduce DistBART, a Bayesian nonparametric approach to distribution regression that models the regression function as a linear functional with the Riesz representer assigned a Bayesian additive regression trees (BART) prior. We argue that shallow decision tree ensembles encode reasonable inductive biases for tabular data, making them appropriate in settings where the functional depends primarily on low-dimensional marginals of the distributions. We show this both empirically on synthetic and real data and theoretically through an adaptive posterior concentration result. We also establish connections to kernel methods, and use this connection to motivate variants of DistBART that can learn nonlinear functionals. To enable scalability to large datasets, we develop a random-feature approximation that samples trees from the BART prior and reduces inference to sparse Bayesian linear regression, achieving computational efficiency while retaining uncertainty quantification.
Paper Structure (39 sections, 3 theorems, 48 equations, 4 figures, 2 algorithms)

This paper contains 39 sections, 3 theorems, 48 equations, 4 figures, 2 algorithms.

Key Result

Theorem 1

Under the DistBART model, we have $[f \mid \kappa] \sim \operatorname{GP}(0, \sigma^2_\mu \mathcal{K})$ where $\mathcal{K}(G, Q) = \iint \kappa(x,x') \ G(dx) \ Q(dx) = \langle \phi_G, \phi_Q \rangle_{\mathcal{H}_\kappa}$ and $\sigma^2_\mu / T$ is the prior variance of $\mu_{t\ell}$. The posterior me

Figures (4)

  • Figure 1: Mapping a spherical Gaussian distribution truncated to $[0,1]^2$, $G_i$, to a feature vector $\phi_i$ using trees in the special case where $T = 1$.
  • Figure 2: Mean test RMSE for four methods: BART-based features, RBF-based features, both BART and RBF-based features, and mean features, aggregated using glmnet for the simulation experiment described in Section \ref{['sec:synthetic-data']}.
  • Figure 3: Performance comparison of distribution regression on the voting dataset, based on 30 repeated 80-20 train/test splits.
  • Figure 4: Posterior summaries of the estimated linear functional from the voting analysis. Top: posterior mean and 95% credible bands for additive summaries of $\psi(x)$ over continuous covariates (left) and categorical (right) covariates. Bottom left: Reduction in test-set $R^2$ when each feature is omitted. Bottom right: posterior mean of feature coefficients from a horseshoe regression fit on the tree-derived features. Codes for categorical features are given in the Supplementary Material.

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 2
  • Lemma 1