On the Computational Efficiency of Bayesian Additive Regression Trees: An Asymptotic Analysis
Yan Shuo Tan, Omer Ronen, Theo Saarinen, Bin Yu
TL;DR
The paper investigates the computational efficiency of Bayesian Additive Regression Trees (BART) by analyzing a modified BART MCMC sampler under discrete covariates. It develops a theoretical framework that marginalizes leaf parameters to study a finite-state Markov chain on tree-structure ensembles, establishing hitting-time lower bounds that grow with the training size n in additive and interaction settings, due to posterior multimodality. It further derives mixing-time upper bounds for three practical remediesâincreasing the number of trees, applying more global moves, and tempering the posteriorâshowing these strategies can keep convergence times effectively constant in n or grow subpolynomially. Complemented by simulations on six datasets, the results corroborate the theoretical claims: the default BART sampler mixes more slowly as data grow, but the proposed modifications improve convergence and uncertainty quantification without compromising the HPDR target. Overall, the work links BIC-based posterior concentration, HPDR sets, and hitting-time theory to provide guidance for improving BARTâs computational efficiency in large-scale settings, with publicly available code for replication and further exploration.
Abstract
Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression model that is commonly used in causal inference and beyond. Its strong predictive performance is supported by well-developed estimation theory, comprising guarantees that its posterior distribution concentrates around the true regression function at optimal rates under various data generative settings and for appropriate prior choices. However, the computational properties of the widely-used BART sampler proposed by Chipman et al. (2010) are yet to be well-understood. In this paper, we perform an asymptotic analysis of a slightly modified version of the default BART sampler when fitted to data-generating processes with discrete covariates. We show that the sampler's time to convergence, evaluated in terms of the hitting time of a high posterior density set, increases with the number of training samples, due to the multi-modal nature of the target posterior. On the other hand, we show that this trend can be dampened by simple changes, such as increasing the number of trees in the ensemble or raising the temperature of the sampler. These results provide a nuanced picture on the computational efficiency of the BART sampler in the presence of large amounts of training data while suggesting strategies to improve the sampler. We complement our theoretical analysis with a simulation study focusing on the default BART sampler. We observe that the increasing trend of convergence time against number training samples holds for the default BART sampler and is robust to changes in sampler initialization, number of burn-in iterations, feature selection prior, and discretization strategy. On the other hand, increasing the number of trees or raising the temperature sharply dampens this trend, as indicated by our theory.
