On the Computational Efficiency of Bayesian Additive Regression Trees: An Asymptotic Analysis

Yan Shuo Tan; Omer Ronen; Theo Saarinen; Bin Yu

On the Computational Efficiency of Bayesian Additive Regression Trees: An Asymptotic Analysis

Yan Shuo Tan, Omer Ronen, Theo Saarinen, Bin Yu

TL;DR

The paper investigates the computational efficiency of Bayesian Additive Regression Trees (BART) by analyzing a modified BART MCMC sampler under discrete covariates. It develops a theoretical framework that marginalizes leaf parameters to study a finite-state Markov chain on tree-structure ensembles, establishing hitting-time lower bounds that grow with the training size n in additive and interaction settings, due to posterior multimodality. It further derives mixing-time upper bounds for three practical remedies—increasing the number of trees, applying more global moves, and tempering the posterior—showing these strategies can keep convergence times effectively constant in n or grow subpolynomially. Complemented by simulations on six datasets, the results corroborate the theoretical claims: the default BART sampler mixes more slowly as data grow, but the proposed modifications improve convergence and uncertainty quantification without compromising the HPDR target. Overall, the work links BIC-based posterior concentration, HPDR sets, and hitting-time theory to provide guidance for improving BART’s computational efficiency in large-scale settings, with publicly available code for replication and further exploration.

Abstract

Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression model that is commonly used in causal inference and beyond. Its strong predictive performance is supported by well-developed estimation theory, comprising guarantees that its posterior distribution concentrates around the true regression function at optimal rates under various data generative settings and for appropriate prior choices. However, the computational properties of the widely-used BART sampler proposed by Chipman et al. (2010) are yet to be well-understood. In this paper, we perform an asymptotic analysis of a slightly modified version of the default BART sampler when fitted to data-generating processes with discrete covariates. We show that the sampler's time to convergence, evaluated in terms of the hitting time of a high posterior density set, increases with the number of training samples, due to the multi-modal nature of the target posterior. On the other hand, we show that this trend can be dampened by simple changes, such as increasing the number of trees in the ensemble or raising the temperature of the sampler. These results provide a nuanced picture on the computational efficiency of the BART sampler in the presence of large amounts of training data while suggesting strategies to improve the sampler. We complement our theoretical analysis with a simulation study focusing on the default BART sampler. We observe that the increasing trend of convergence time against number training samples holds for the default BART sampler and is robust to changes in sampler initialization, number of burn-in iterations, feature selection prior, and discretization strategy. On the other hand, increasing the number of trees or raising the temperature sharply dampens this trend, as indicated by our theory.

On the Computational Efficiency of Bayesian Additive Regression Trees: An Asymptotic Analysis

TL;DR

Abstract

Paper Structure (88 sections, 45 theorems, 243 equations, 15 figures, 1 table, 1 algorithm)

This paper contains 88 sections, 45 theorems, 243 equations, 15 figures, 1 table, 1 algorithm.

Introduction
The rise of BART
Observed poor mixing of BART and its significance
BART MCMC and prior theoretical work
Main contributions
Data generation models for BART and for frequentist analysis
Generative model for frequentist analysis
Bayesian model specification for BART
Regression trees
Sum-of-trees model
Priors
Differences with in-practice BART
Sampling from BART via MCMC
The in-practice BART sampler
The analyzed BART sampler
...and 73 more sections

Key Result

Proposition 4.1

Consider two TSEs $\mathfrak{E}$ and $\mathfrak{E}'$ and denote the difference in their BIC values as $\Delta \operatorname{BIC}(\mathfrak{E},\mathfrak{E}') \coloneqq \operatorname{BIC}(\mathfrak{E}) - \operatorname{BIC}(\mathfrak{E}')$. We have If furthermore, both TSEs have the same bias, i.e. $\Pi_{\mathfrak{E}}[f^*] = \Pi_{\mathfrak{E}'}[f^*]$, then we have

Figures (15)

Figure 1: The Bayesian CART model is not identifiable at the level of tree structures. The two tree structures shown on the left both realize the same partition of the covariate space and even the same regression function, which is shown on the right.
Figure 2: Visual illustration of Proposition \ref{['prop:recipe']}. The chain is initialized at $\mathfrak{E}_0$ and with positive probability hits a suboptimal TSE $\mathfrak{E}_{\text{bad}}$ before $\operatorname{OPT}_{m}(f^*,k)$. This causes to chain to get stuck, as it can only reach $\operatorname{OPT}_{m}(f^*,k)$ by passing through an "impassable" barrier set $\mathcal{B}$.
Figure 3: Visual illustration of the construction for $\mathfrak{E}_{\text{bad}}$ used in the proof of Theorem \ref{['thm:additive']}. The left panel displays the function $f_1(x_1) + f_2(x_2)$ together with all knots of $f_1$ and $f_2$. We define $\mathfrak{E}_{\text{bad}}$ so that its first two trees $\mathfrak{T}_1$ and $\mathfrak{T}_2$ induce the partitions $\mathbb{V}_1$ and $\mathbb{V}_2$ respectively. These combine for a total of 13 leaves. On the other hand, an optimal TSE will instead make use of $\mathbb{V}_1^*$ and $\mathbb{V}_2^*$, which combine for a total of only 8 leaves. Nonetheless, we still have $f_1 + f_2 \in \mathbb{V}_1^* + \mathbb{V}_2^*$.
Figure 4: Values for Gelman-Rubin $\hat{R}$ (left), coverage (center), and RMSE (right) for the BART sampler under different fixed temperatures ($T \in \lbrace1,2,3\rbrace$) as well as a linear temperature schedule. Results are plotted for the California Housing (top), Low-Dimensional Smooth (middle), and Echo Months (bottom) datasets. Error bars represent $\pm 1.96$ standard errors from 25 replicates.
Figure 5: Values for Gelman-Rubin $\hat{R}$ for the BART sampler under different fixed temperatures ($T \in \lbrace1,2,3\rbrace$) when $\hat{R}$ is computed with respect to the 0.05, 0.25, 0.50, 0.75, and 0.95 quantiles for the fitted responses on the held-out test set. Results are plotted for the California Housing (left), Low Dimensional Smooth (center), and Echo Months (right) datasets.
...and 10 more figures

Theorems & Definitions (86)

Proposition 4.1: Concentration of BIC differences
Proposition 4.2: Log marginal likelihood and BIC
Proposition 4.3: BART posterior concentration
Theorem 5.1: Lower bounds for additive model
Theorem 5.2: Lower bound for pure interaction
Theorem 5.3: Lower bound for Bayesian CART with root dependence
Remark 5.4
Remark 5.5: Relationship between hitting times and mixing times
Proposition 6.1: Recipe for hitting time lower bounds
Theorem 7.1: Upper bound from increasing number of trees
...and 76 more

On the Computational Efficiency of Bayesian Additive Regression Trees: An Asymptotic Analysis

TL;DR

Abstract

On the Computational Efficiency of Bayesian Additive Regression Trees: An Asymptotic Analysis

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (86)