Table of Contents
Fetching ...

On the Computational Efficiency of Bayesian Additive Regression Trees: An Asymptotic Analysis

Yan Shuo Tan, Omer Ronen, Theo Saarinen, Bin Yu

TL;DR

The paper investigates the computational efficiency of Bayesian Additive Regression Trees (BART) by analyzing a modified BART MCMC sampler under discrete covariates. It develops a theoretical framework that marginalizes leaf parameters to study a finite-state Markov chain on tree-structure ensembles, establishing hitting-time lower bounds that grow with the training size n in additive and interaction settings, due to posterior multimodality. It further derives mixing-time upper bounds for three practical remedies—increasing the number of trees, applying more global moves, and tempering the posterior—showing these strategies can keep convergence times effectively constant in n or grow subpolynomially. Complemented by simulations on six datasets, the results corroborate the theoretical claims: the default BART sampler mixes more slowly as data grow, but the proposed modifications improve convergence and uncertainty quantification without compromising the HPDR target. Overall, the work links BIC-based posterior concentration, HPDR sets, and hitting-time theory to provide guidance for improving BART’s computational efficiency in large-scale settings, with publicly available code for replication and further exploration.

Abstract

Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression model that is commonly used in causal inference and beyond. Its strong predictive performance is supported by well-developed estimation theory, comprising guarantees that its posterior distribution concentrates around the true regression function at optimal rates under various data generative settings and for appropriate prior choices. However, the computational properties of the widely-used BART sampler proposed by Chipman et al. (2010) are yet to be well-understood. In this paper, we perform an asymptotic analysis of a slightly modified version of the default BART sampler when fitted to data-generating processes with discrete covariates. We show that the sampler's time to convergence, evaluated in terms of the hitting time of a high posterior density set, increases with the number of training samples, due to the multi-modal nature of the target posterior. On the other hand, we show that this trend can be dampened by simple changes, such as increasing the number of trees in the ensemble or raising the temperature of the sampler. These results provide a nuanced picture on the computational efficiency of the BART sampler in the presence of large amounts of training data while suggesting strategies to improve the sampler. We complement our theoretical analysis with a simulation study focusing on the default BART sampler. We observe that the increasing trend of convergence time against number training samples holds for the default BART sampler and is robust to changes in sampler initialization, number of burn-in iterations, feature selection prior, and discretization strategy. On the other hand, increasing the number of trees or raising the temperature sharply dampens this trend, as indicated by our theory.

On the Computational Efficiency of Bayesian Additive Regression Trees: An Asymptotic Analysis

TL;DR

The paper investigates the computational efficiency of Bayesian Additive Regression Trees (BART) by analyzing a modified BART MCMC sampler under discrete covariates. It develops a theoretical framework that marginalizes leaf parameters to study a finite-state Markov chain on tree-structure ensembles, establishing hitting-time lower bounds that grow with the training size n in additive and interaction settings, due to posterior multimodality. It further derives mixing-time upper bounds for three practical remedies—increasing the number of trees, applying more global moves, and tempering the posterior—showing these strategies can keep convergence times effectively constant in n or grow subpolynomially. Complemented by simulations on six datasets, the results corroborate the theoretical claims: the default BART sampler mixes more slowly as data grow, but the proposed modifications improve convergence and uncertainty quantification without compromising the HPDR target. Overall, the work links BIC-based posterior concentration, HPDR sets, and hitting-time theory to provide guidance for improving BART’s computational efficiency in large-scale settings, with publicly available code for replication and further exploration.

Abstract

Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression model that is commonly used in causal inference and beyond. Its strong predictive performance is supported by well-developed estimation theory, comprising guarantees that its posterior distribution concentrates around the true regression function at optimal rates under various data generative settings and for appropriate prior choices. However, the computational properties of the widely-used BART sampler proposed by Chipman et al. (2010) are yet to be well-understood. In this paper, we perform an asymptotic analysis of a slightly modified version of the default BART sampler when fitted to data-generating processes with discrete covariates. We show that the sampler's time to convergence, evaluated in terms of the hitting time of a high posterior density set, increases with the number of training samples, due to the multi-modal nature of the target posterior. On the other hand, we show that this trend can be dampened by simple changes, such as increasing the number of trees in the ensemble or raising the temperature of the sampler. These results provide a nuanced picture on the computational efficiency of the BART sampler in the presence of large amounts of training data while suggesting strategies to improve the sampler. We complement our theoretical analysis with a simulation study focusing on the default BART sampler. We observe that the increasing trend of convergence time against number training samples holds for the default BART sampler and is robust to changes in sampler initialization, number of burn-in iterations, feature selection prior, and discretization strategy. On the other hand, increasing the number of trees or raising the temperature sharply dampens this trend, as indicated by our theory.
Paper Structure (88 sections, 45 theorems, 243 equations, 15 figures, 1 table, 1 algorithm)

This paper contains 88 sections, 45 theorems, 243 equations, 15 figures, 1 table, 1 algorithm.

Key Result

Proposition 4.1

Consider two TSEs $\mathfrak{E}$ and $\mathfrak{E}'$ and denote the difference in their BIC values as $\Delta \operatorname{BIC}(\mathfrak{E},\mathfrak{E}') \coloneqq \operatorname{BIC}(\mathfrak{E}) - \operatorname{BIC}(\mathfrak{E}')$. We have If furthermore, both TSEs have the same bias, i.e. $\Pi_{\mathfrak{E}}[f^*] = \Pi_{\mathfrak{E}'}[f^*]$, then we have

Figures (15)

  • Figure 1: The Bayesian CART model is not identifiable at the level of tree structures. The two tree structures shown on the left both realize the same partition of the covariate space and even the same regression function, which is shown on the right.
  • Figure 2: Visual illustration of Proposition \ref{['prop:recipe']}. The chain is initialized at $\mathfrak{E}_0$ and with positive probability hits a suboptimal TSE $\mathfrak{E}_{\text{bad}}$ before $\operatorname{OPT}_{m}(f^*,k)$. This causes to chain to get stuck, as it can only reach $\operatorname{OPT}_{m}(f^*,k)$ by passing through an "impassable" barrier set $\mathcal{B}$.
  • Figure 3: Visual illustration of the construction for $\mathfrak{E}_{\text{bad}}$ used in the proof of Theorem \ref{['thm:additive']}. The left panel displays the function $f_1(x_1) + f_2(x_2)$ together with all knots of $f_1$ and $f_2$. We define $\mathfrak{E}_{\text{bad}}$ so that its first two trees $\mathfrak{T}_1$ and $\mathfrak{T}_2$ induce the partitions $\mathbb{V}_1$ and $\mathbb{V}_2$ respectively. These combine for a total of 13 leaves. On the other hand, an optimal TSE will instead make use of $\mathbb{V}_1^*$ and $\mathbb{V}_2^*$, which combine for a total of only 8 leaves. Nonetheless, we still have $f_1 + f_2 \in \mathbb{V}_1^* + \mathbb{V}_2^*$.
  • Figure 4: Values for Gelman-Rubin $\hat{R}$ (left), coverage (center), and RMSE (right) for the BART sampler under different fixed temperatures ($T \in \lbrace1,2,3\rbrace$) as well as a linear temperature schedule. Results are plotted for the California Housing (top), Low-Dimensional Smooth (middle), and Echo Months (bottom) datasets. Error bars represent $\pm 1.96$ standard errors from 25 replicates.
  • Figure 5: Values for Gelman-Rubin $\hat{R}$ for the BART sampler under different fixed temperatures ($T \in \lbrace1,2,3\rbrace$) when $\hat{R}$ is computed with respect to the 0.05, 0.25, 0.50, 0.75, and 0.95 quantiles for the fitted responses on the held-out test set. Results are plotted for the California Housing (left), Low Dimensional Smooth (center), and Echo Months (right) datasets.
  • ...and 10 more figures

Theorems & Definitions (86)

  • Proposition 4.1: Concentration of BIC differences
  • Proposition 4.2: Log marginal likelihood and BIC
  • Proposition 4.3: BART posterior concentration
  • Theorem 5.1: Lower bounds for additive model
  • Theorem 5.2: Lower bound for pure interaction
  • Theorem 5.3: Lower bound for Bayesian CART with root dependence
  • Remark 5.4
  • Remark 5.5: Relationship between hitting times and mixing times
  • Proposition 6.1: Recipe for hitting time lower bounds
  • Theorem 7.1: Upper bound from increasing number of trees
  • ...and 76 more