Table of Contents
Fetching ...

When Bayes goes bad: Weakly-regularized covariate adjustment leads to a biased estimate of prevalence

Swen Kuh, Lauren Kennedy, Qixuan Chen, Andrew Gelman

Abstract

When estimating population prevalence from a non-random sample, it is important to adjust for differences between sample and population. However, adjustment for multiple factors requires analysis that can be difficult to understand and validate. In this manuscript, we explore an unexpected downward trend of estimates when covariates are added sequentially to a Bayesian hierarchical model for the estimation of the prevalence of SARS-CoV-2 specific antibodies in an Australian city in late 2020. We compare our data analysis to results from a simulation study to understand four potential contributors to this effect: (i) correction for differences between sample and population, (ii) rare-events bias in logistic regression, (iii) inclusion of the uncertainty of test sensitivity and specificity in a multilevel model, and (iv) increasing model dimensionality. We find that weak prior distributions on the logistic regression coefficients lead to a systematic increase in the amount of partial pooling across adjustment cells-the prior becomes stronger as model dimensionality increases-which in turn feeds through to the estimated assay specificity, which then feeds back to the model and results in lowering the estimated prevalence. Our paper contributes three elements: (i) immediate and longer-term recommendations for using these types of models, (ii) simulation studies to explore the impact of the contributors to this effect, and (iii) a worked example of investigation of unexpected results in a model with multiple adjustment factors.

When Bayes goes bad: Weakly-regularized covariate adjustment leads to a biased estimate of prevalence

Abstract

When estimating population prevalence from a non-random sample, it is important to adjust for differences between sample and population. However, adjustment for multiple factors requires analysis that can be difficult to understand and validate. In this manuscript, we explore an unexpected downward trend of estimates when covariates are added sequentially to a Bayesian hierarchical model for the estimation of the prevalence of SARS-CoV-2 specific antibodies in an Australian city in late 2020. We compare our data analysis to results from a simulation study to understand four potential contributors to this effect: (i) correction for differences between sample and population, (ii) rare-events bias in logistic regression, (iii) inclusion of the uncertainty of test sensitivity and specificity in a multilevel model, and (iv) increasing model dimensionality. We find that weak prior distributions on the logistic regression coefficients lead to a systematic increase in the amount of partial pooling across adjustment cells-the prior becomes stronger as model dimensionality increases-which in turn feeds through to the estimated assay specificity, which then feeds back to the model and results in lowering the estimated prevalence. Our paper contributes three elements: (i) immediate and longer-term recommendations for using these types of models, (ii) simulation studies to explore the impact of the contributors to this effect, and (iii) a worked example of investigation of unexpected results in a model with multiple adjustment factors.

Paper Structure

This paper contains 21 sections, 27 equations, 20 figures, 5 tables.

Figures (20)

  • Figure 1: MRP seroprevalence estimate ($y$-axis) for Melbourne metropolitan residential population using models ($x$-axis) with sequentially added covariates as in Table \ref{['tab:cov_table']}. Color represents the original results creating a prevalence estimate for the population (black) and a prevalence estimate for the sample (light green). The similarity between these two lines suggests that a modeling issue rather than a sample adjustment issue. Uncertainty bars represent 90% credible intervals. The light gray dotted line indicates the sample mean, while the colored dotted lines and annotations are used to describe the three different challenges. We focus only on models 0 to 5 in this work.
  • Figure 2: Effect of sensitivity and specificity on the accuracy of estimating the true population rate $\pi$ from the observed test prevalence $p$. Orange tones indicate that the observed test prevalence is higher than the disease prevalence, while blue tones indicate observed test prevalence is lower than the disease prevalence. Panels indicate different levels of disease prevalence, demonstrating that when disease prevalence is low, specificity constrains the recovered value. This shows the lower bound of the inequality. When prevalence is high, the sensitivity bounds the error. In the middle facets, we see a tradeoff between the two, showing the conditional use of inequality.
  • Figure 3: Distribution of bias ($y$-axis) between posterior median estimate for prevalence ($\pi$) and truth over 100 iterations using varying sample sizes ($x$-axis) from the intercept-only models, for true probability of outcome 0.001, 0.01, 0.1, 0.2, in each panel. Box plots are used to demonstrate variance across simulation iterations, with orange used to represent Bayesian models and green used to implement the models implemented using the glm function in R. Negative values indicate where the estimate is smaller than the truth (rare-events bias), while positive estimates indicate where the estimate is larger than the truth. Variance and outliers for smaller sample sizes and low prevalence are driven by the estimates clumping based on sample properties, which is shown more clearly in Appendix \ref{['ssec:alliter']}.
  • Figure 4: Bias for estimates ($y$-axis) using models 0--5 with sequentially added covariates ($x$-axis) for 100 iterations when $\zeta_1 = 0.3$, sample size $= 400$ (left panel) and 4000 (right panel). Box plots are used to demonstrate variance across simulation iterations, with orange used to represent the predicted sample estimates, and purple used to represent the estimated intercept terms. The predicted estimates are largely unbiased across models, while the estimated intercept shows increasing negative bias as the model has more covariates. The pattern is less exaggerated when the sample size is $4000$.
  • Figure 5: Estimate for sample mean and intercept parameter ($\beta_0$) from models 0--5 fit with the real data. The black line shows the predicted estimate without measurement error and overall effects terms, and the yellow line shows results from the full model with measurement error and overall effects terms. The right panel plots the intercept for these two models, while the left panel shows the prevalence estimate for the sample. The models that do not include measurement error do not exhibit underestimation or a downward slope as additional covariates are added. This suggests that the issue may stem from the inclusion of either measurement error or unmodeled coefficients in the model.
  • ...and 15 more figures