Table of Contents
Fetching ...

Leveraging text data for causal inference using electronic health records

Reagan Mozer, Aaron R. Kaufman, Leo A. Celi, Luke Miratrix

TL;DR

This paper tackles the challenge of causal inference from electronic health records by integrating unstructured clinical text with standard analytical tools. It develops a unified framework that (i) uses text to augment missing-data imputation via MNIR and MICE, (ii) strengthens causal identification through text-informed matching, and (iii) uncovers treatment-effect heterogeneity by conditioning on text-derived features. In an observational study of transthoracic echocardiography (TTE) in sepsis patients, the approach improves imputation accuracy, enhances covariate balance, and reveals substantial heterogeneity in treatment effects, identifying patient subgroups that may benefit most or be harmed by treatment. The work demonstrates that routinely collected text notes can meaningfully expand causal analysis in clinical research, with replication materials and code provided to promote adoption and replication, including applications in settings with limited structured data.

Abstract

In studies that rely on data from electronic health records (EHRs), unstructured text data such as clinical progress notes offer a rich source of information about patient characteristics and care that may be missing from structured data. Despite the prevalence of text in clinical research, these data are often ignored for the purposes of quantitative analysis due their complexity. This paper presents a unified framework for leveraging text data to support causal inference with electronic health data at multiple stages of analysis. In particular, we consider how natural language processing and statistical text analysis can be combined with standard inferential techniques to address common challenges due to missing data, confounding bias, and treatment effect heterogeneity. Through an application to a recent EHR study investigating the effects of a non-randomized medical intervention on patient outcomes, we show how incorporating text data in a traditional matching analysis can help strengthen the validity of an estimated treatment effect and identify patient subgroups that may benefit most from treatment. We believe these methods have the potential to expand the scope of secondary analysis of clinical data to domains where structured EHR data is limited, such as in developing countries. To this end, we provide code and open-source replication materials to encourage adoption and broader exploration of these techniques in clinical research.

Leveraging text data for causal inference using electronic health records

TL;DR

This paper tackles the challenge of causal inference from electronic health records by integrating unstructured clinical text with standard analytical tools. It develops a unified framework that (i) uses text to augment missing-data imputation via MNIR and MICE, (ii) strengthens causal identification through text-informed matching, and (iii) uncovers treatment-effect heterogeneity by conditioning on text-derived features. In an observational study of transthoracic echocardiography (TTE) in sepsis patients, the approach improves imputation accuracy, enhances covariate balance, and reveals substantial heterogeneity in treatment effects, identifying patient subgroups that may benefit most or be harmed by treatment. The work demonstrates that routinely collected text notes can meaningfully expand causal analysis in clinical research, with replication materials and code provided to promote adoption and replication, including applications in settings with limited structured data.

Abstract

In studies that rely on data from electronic health records (EHRs), unstructured text data such as clinical progress notes offer a rich source of information about patient characteristics and care that may be missing from structured data. Despite the prevalence of text in clinical research, these data are often ignored for the purposes of quantitative analysis due their complexity. This paper presents a unified framework for leveraging text data to support causal inference with electronic health data at multiple stages of analysis. In particular, we consider how natural language processing and statistical text analysis can be combined with standard inferential techniques to address common challenges due to missing data, confounding bias, and treatment effect heterogeneity. Through an application to a recent EHR study investigating the effects of a non-randomized medical intervention on patient outcomes, we show how incorporating text data in a traditional matching analysis can help strengthen the validity of an estimated treatment effect and identify patient subgroups that may benefit most from treatment. We believe these methods have the potential to expand the scope of secondary analysis of clinical data to domains where structured EHR data is limited, such as in developing countries. To this end, we provide code and open-source replication materials to encourage adoption and broader exploration of these techniques in clinical research.
Paper Structure (16 sections, 4 figures, 3 tables)

This paper contains 16 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Analytical workflow for leveraging unstructured text data to support causal inference with EHR data.
  • Figure 2: Performance of the multiple imputation models for each covariate when conditioning on both the observed structured covariates and text-based covariates (red) compared to the structured covariates alone (blue).
  • Figure 3: Standardized differences in means between treatment and control groups for 15 structured baseline covariates (top) and 16 text-based covariates (bottom) before matching (black), after propensity score matching (red), and after text matching (blue). Point estimates and 95% confidence intervals are aggregated across 5 multiply imputed versions of the structured covariates. Text matching maintains original balance on structural covariates and improves balance on text covariates.
  • Figure 4: Interaction Effects for thirteen variables. Variables in blue are determined by optimal cutoffs of structured covariates; variables in red are determined by the presence or absence of given text features. Text features identify groupings with comparably large or larger differences in impact than structured covariates.