Table of Contents
Fetching ...

Tree-Based Predictive Models for Noisy Input Data

Kevin McCoy, Zachary Wooten, Christine B. Peterson

TL;DR

This work proposes measurement error BART (meBART), a novel extension to the BART model that directly incorporates measurement error in the independent variable(s) and demonstrates the utility of the proposed approach through two biomedical applications where the predictors of interest are subject to measurement error.

Abstract

Measurement error is prevalent across all domains of scientific research where only imprecise observations, rather than the true underlying values, can be obtained. For example, estimates of human microbiome diversity are based on small samples from a much larger, generally unobserved system and reflect both sampling error and technical variation. In high-noise settings like these, it becomes difficult to make accurate predictions and to summarize uncertainty. Methods have previously been proposed to accommodate measurement error in classic predictive models, such as linear regression. However, relatively little work has been done to address measurement error in more complex and flexible models. Bayesian additive regression trees (BART), a Bayesian nonparametric model that sums the output of many decision trees, offers robust predictions with built-in uncertainty quantification. In this work, we propose measurement error BART (meBART), a novel extension to the BART model that directly incorporates measurement error in the independent variable(s). Through simulation studies, we show that in the presence of measurement error, our model enables more accurate parameter estimation, more robust uncertainty quantification, and superior predictive performance. We illustrate the utility of our proposed approach through two biomedical applications where the predictors of interest are subject to measurement error.

Tree-Based Predictive Models for Noisy Input Data

TL;DR

This work proposes measurement error BART (meBART), a novel extension to the BART model that directly incorporates measurement error in the independent variable(s) and demonstrates the utility of the proposed approach through two biomedical applications where the predictors of interest are subject to measurement error.

Abstract

Measurement error is prevalent across all domains of scientific research where only imprecise observations, rather than the true underlying values, can be obtained. For example, estimates of human microbiome diversity are based on small samples from a much larger, generally unobserved system and reflect both sampling error and technical variation. In high-noise settings like these, it becomes difficult to make accurate predictions and to summarize uncertainty. Methods have previously been proposed to accommodate measurement error in classic predictive models, such as linear regression. However, relatively little work has been done to address measurement error in more complex and flexible models. Bayesian additive regression trees (BART), a Bayesian nonparametric model that sums the output of many decision trees, offers robust predictions with built-in uncertainty quantification. In this work, we propose measurement error BART (meBART), a novel extension to the BART model that directly incorporates measurement error in the independent variable(s). Through simulation studies, we show that in the presence of measurement error, our model enables more accurate parameter estimation, more robust uncertainty quantification, and superior predictive performance. We illustrate the utility of our proposed approach through two biomedical applications where the predictors of interest are subject to measurement error.
Paper Structure (20 sections, 13 equations, 9 figures, 1 table, 2 algorithms)

This paper contains 20 sections, 13 equations, 9 figures, 1 table, 2 algorithms.

Figures (9)

  • Figure 1: One-dimensional input data simulation results summarized over 100 simulated datasets.
  • Figure 2: Comparison of BART and meBART function estimation and 95% credible intervals for one simulated data set from the indicator function setting. The true underlying function is plotted as a solid black line and the data points with measurement error added are shown as blue dots. The dotted black lines and gray regions represent the point-wise mean and point-wise 95% posterior credible interval.
  • Figure 3: The true $x_i$ variables, the observed noisy $x_i^*$, and the posterior estimates of $x_i$, $\hat{x}_i$. Arrows point from $x_i^*$ to $\hat{x}_i$ and represent our updated beliefs about the true value of $x_i$.
  • Figure 4: Multidimensional simulation setting results summarized over 100 independently generated datasets.
  • Figure 5: $\sigma$ trace plots for one simulated data set from the indicator function setting. Both BART and meBART utilized 200 MCMC iterations for burn-in. The dashed black line represents the true underlying value of $\sigma=0.1$.
  • ...and 4 more figures