Table of Contents
Fetching ...

A proxy-based approach for unmeasured confounding in electronic health records research

Haley Colgate Kottler, Amy Cochran

Abstract

Electronic health records (EHR) are widely used to study clinical decisions, yet unmeasured confounding remains a persistent challenge. Proxy variables offer a potential solution. In EHR data, clinicians already record many such measurements (e.g., vitals), each revealing something about a patient's underlying health. Despite this, proxy-based methods are rarely used in practice. We introduce a new way to use proxies to adjust for unmeasured confounding. Our approach uses a vector of proxies to construct covariates that capture aspects of the unmeasured confounder, which are then included in a regression model. As one implementation, we use factor analysis followed by regression. We compare this approach with existing methods, including proximal causal inference, across a range of realistic settings. In practice, assumptions rarely hold exactly, so we study what happens when models are misspecified and variables are used incorrectly: e.g., a confounder or instrument is treated as a proxy. Finally, we apply the method to EHR data to estimate the effect of hospital admission for older adults presenting to the emergency department with chest pain, a setting where unmeasured confounding is a substantial concern. This work provides a practical way to use proxies and may help bring proxy-based methods into broader use.

A proxy-based approach for unmeasured confounding in electronic health records research

Abstract

Electronic health records (EHR) are widely used to study clinical decisions, yet unmeasured confounding remains a persistent challenge. Proxy variables offer a potential solution. In EHR data, clinicians already record many such measurements (e.g., vitals), each revealing something about a patient's underlying health. Despite this, proxy-based methods are rarely used in practice. We introduce a new way to use proxies to adjust for unmeasured confounding. Our approach uses a vector of proxies to construct covariates that capture aspects of the unmeasured confounder, which are then included in a regression model. As one implementation, we use factor analysis followed by regression. We compare this approach with existing methods, including proximal causal inference, across a range of realistic settings. In practice, assumptions rarely hold exactly, so we study what happens when models are misspecified and variables are used incorrectly: e.g., a confounder or instrument is treated as a proxy. Finally, we apply the method to EHR data to estimate the effect of hospital admission for older adults presenting to the emergency department with chest pain, a setting where unmeasured confounding is a substantial concern. This work provides a practical way to use proxies and may help bring proxy-based methods into broader use.

Paper Structure

This paper contains 11 sections, 7 theorems, 39 equations, 6 figures, 1 table.

Key Result

Theorem 3.1

Assume Assumptions assm:identify--assm:express hold with vector-valued functions $\tau$ and $g$ defined therein. Then as long as $\tau(A,X)$ is the unique solution to: we can identify the CATE by way of

Figures (6)

  • Figure 1: DAGs illustrating unmeasured confounding and proxy variables: A) A proxy variable $Z$ influenced by an unmeasured confounder $U$ but independent of both treatment $A$ and outcome $Y$ given $U$. B) Proxy variables $V$ and $W$, respectively called a negative control exposure and negative control outcome, both influenced by an unmeasured confounder $U$ but still dependent on either the treatment $A$ or outcome $Y$ given $U$.
  • Figure 2: DAGs describing unmeasured confounding: A) with a general proxy variable, B) with alternative causal models that satisfy Corollary \ref{['corr:proof']}.
  • Figure 3: A) Estimation accuracy of our method and comparison methods as a function of sample size, reported as the median estimate and interquartile ranges (25th and 75th percentiles) across simulation replications. B) Estimation accuracy as function of different ratios of proxies to latent confounders ($p/k$), reported as the median estimate across simulation replications. Our method is the most accurate for all sample sizes, rivaled only by proximal causal inference which has a higher mean squared error. With $p/k\geq2$ our method is more accurate than the comparison methods, though with higher $p/k$ ratios, proximal causal inference obtains similar accuracy.
  • Figure 4: Performance of methods under different modeling and structural assumption violations: A) the true outcome generation process is quadratic in $U$ but the estimator is linear, B) the latent variable is skew normal instead of normal, C) the treatment is binary but estimated as continuous, D) $Z$ is a direct confounder not a proxy, E) $Z$ is an instrument not a proxy. The centerline is the median, the hinges show the inter-quartile range (IQR), and the whiskers show the minimum of the highest estimate or 1.5IQR from the upper hinge and the maximum of the lowest estimate or 1.5IQR from the lower hinge.
  • Figure 5: A) Estimation accuracy of our method and comparison methods as a function of sample size, reported as the median estimate and interquartile ranges (25th and 75th percentiles) across simulation replications. B) Estimation accuracy as function of different ratios of proxies to latent confounders ($p/k$), reported as the median estimate across simulation replications. Our method is the most accurate for all sample sizes, rivaled only by proximal causal inference which has a higher mean squared error. With $p/k\geq2$ our method is more accurate than the comparison methods, though with higher $p/k$ ratios, proximal causal inference obtains similar accuracy.
  • ...and 1 more figures

Theorems & Definitions (10)

  • Theorem 3.1
  • Corollary 3.1
  • Lemma 3.1
  • Corollary 3.2
  • proof
  • Lemma A.1
  • proof
  • Lemma A.2
  • proof
  • Corollary A.1