Table of Contents
Fetching ...

Co-data Learning for Bayesian Additive Regression Trees

Jeroen M. Goedhart, Thomas Klausch, Jurriaan Janssen, Mark A. van de Wiel

TL;DR

An empirical Bayes (EB) framework is developed that estimates, assisted by a co-data model, prior covariate weights in the BART model and enhances prediction in an application to diffuse large B-cell lymphoma prognosis based on clinical covariates, gene mutations, DNA translocations, and DNA copy number data.

Abstract

Medical prediction applications often need to deal with small sample sizes compared to the number of covariates. Such data pose problems for prediction and variable selection, especially when the covariate-response relationship is complicated. To address these challenges, we propose to incorporate co-data, i.e. external information on the covariates, into Bayesian additive regression trees (BART), a sum-of-trees prediction model that utilizes priors on the tree parameters to prevent overfitting. To incorporate co-data, an empirical Bayes (EB) framework is developed that estimates, assisted by a co-data model, prior covariate weights in the BART model. The proposed method can handle multiple types of co-data simultaneously. Furthermore, the proposed EB framework enables the estimation of the other hyperparameters of BART as well, rendering an appealing alternative to cross-validation. We show that the method finds relevant covariates and that it improves prediction compared to default BART in simulations. If the covariate-response relationship is nonlinear, the method benefits from the flexibility of BART to outperform regression-based co-data learners. Finally, the use of co-data enhances prediction in an application to diffuse large B-cell lymphoma prognosis based on clinical covariates, gene mutations, DNA translocations, and DNA copy number data. Keywords: Bayesian additive regression trees; Empirical Bayes; Co-data; High-dimensional data; Omics; Prediction

Co-data Learning for Bayesian Additive Regression Trees

TL;DR

An empirical Bayes (EB) framework is developed that estimates, assisted by a co-data model, prior covariate weights in the BART model and enhances prediction in an application to diffuse large B-cell lymphoma prognosis based on clinical covariates, gene mutations, DNA translocations, and DNA copy number data.

Abstract

Medical prediction applications often need to deal with small sample sizes compared to the number of covariates. Such data pose problems for prediction and variable selection, especially when the covariate-response relationship is complicated. To address these challenges, we propose to incorporate co-data, i.e. external information on the covariates, into Bayesian additive regression trees (BART), a sum-of-trees prediction model that utilizes priors on the tree parameters to prevent overfitting. To incorporate co-data, an empirical Bayes (EB) framework is developed that estimates, assisted by a co-data model, prior covariate weights in the BART model. The proposed method can handle multiple types of co-data simultaneously. Furthermore, the proposed EB framework enables the estimation of the other hyperparameters of BART as well, rendering an appealing alternative to cross-validation. We show that the method finds relevant covariates and that it improves prediction compared to default BART in simulations. If the covariate-response relationship is nonlinear, the method benefits from the flexibility of BART to outperform regression-based co-data learners. Finally, the use of co-data enhances prediction in an application to diffuse large B-cell lymphoma prognosis based on clinical covariates, gene mutations, DNA translocations, and DNA copy number data. Keywords: Bayesian additive regression trees; Empirical Bayes; Co-data; High-dimensional data; Omics; Prediction
Paper Structure (22 sections, 14 equations, 3 figures, 2 tables)

This paper contains 22 sections, 14 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Boxplots of the co-data moderated EB estimates of the covariate weights of EB-coBART across the $500$ simulated data sets for different simulation settings. Figure (a) shows results for $G=5$ and Figure (b) for $G=20,$ for which we only depict the first $10$ groups for visualization. For each group, left (blue) boxplot corresponds to the rigid tree setting ($\alpha=0.1,$$\beta=4,$$k=1$) and right (green) boxplot to the flexible one ($\alpha=0.95,$$\beta=2,$$k=2$). Outliers are not shown. The horizontal dotted lines correspond to equal group weights ($0.2$ for $G=5;$$0.05$ for $G=20$).
  • Figure 2: Boxplot of the ratio $\textrm{PMSE}_{\textrm{EBcoBART}}/\textrm{PMSE}_{\textrm{BART}}$ across the $500$ simulated data sets for both the rigid tree models (blue, left) and the flexible tree models (green, right). The four panels correspond to different simulation settings.
  • Figure 3: (a) Learning curves for EB-coBART 2 (triangles) and BART (dots). (b) Estimated WAIC (dots) for $18$ iterations with the minimum indicated by the dashed vertical line at iteration $9.$(c) Estimated cumulative covariate weights for the four types of covariates as a function of external p-values on the $-\textrm{logit}$ scale for EB-coBART 2. Types of covariates are indicated by a square (copy number variation), a triangle (mutation), a diamond (translocation), and a circle (IPI). (d) Partial dependence plots of EB-coBART 2 (triangles) and BART (dots) showing the marginal effect of IPI on the predictions. On the y-axis, we show the latent response values, i.e. $Z$-values of the standard normal cdf, because BART models use a probit link for binary responses. We show the average $\pm$ the standard deviation across the Gibbs samples of the latent response.