StochTree: BART-based modeling in R and Python
Andrew Herren, P. Richard Hahn, Jared Murray, Carlos Carvalho
TL;DR
StochTree unifies BART-based modeling in R and Python by providing a shared C++ core that powers interoperable bindings and a broad set of extensions beyond classic BART, including BCF, random effects, heteroskedastic forests, and leafwise linear models. The paper details the BART prior, the extended feature set, and a practical three-part workflow (data preprocessing, prior specification, and algorithm settings) followed by prediction, diagnostics, and serialization; it also demonstrates the approach with a Friedman dataset example. By exposing low-level interfaces and supporting cross-language model serialization, stochtree enables rapid prototyping of novel Bayesian tree ensembles and smoother collaboration between language ecosystems. The work emphasizes extensibility and computational efficiency, aiming to bridge the gap between research innovations in BART and practical use in applied settings.
Abstract
stochtree is a C++ library for Bayesian tree ensemble models such as BART and Bayesian Causal Forests (BCF), as well as user-specified variations. Unlike previous BART packages, stochtree provides bindings to both R and Python for full interoperability. stochtree boasts a more comprehensive range of models relative to previous packages, including heteroskedastic forests, random effects, and treed linear models. Additionally, stochtree offers flexible handling of model fits: the ability to save model fits, reinitialize models from existing fits (facilitating improved model initialization heuristics), and pass fits between R and Python. On both platforms, stochtree exposes lower-level functionality, allowing users to specify models incorporating Bayesian tree ensembles without needing to modify C++ code. We illustrate the use of stochtree in three settings: i) straightfoward applications of existing models such as BART and BCF, ii) models that include more sophisticated components like heteroskedasticity and leaf-wise regression models, and iii) as a component of custom MCMC routines to fit nonstandard tree ensemble models.
