Proximal Causal Inference With Text Data

Jacob M. Chen; Rohit Bhattacharya; Katherine A. Keith

Proximal Causal Inference With Text Data

Jacob M. Chen, Rohit Bhattacharya, Katherine A. Keith

TL;DR

This work proposes a new causal inference method that uses two instances of pre-treatment text data, infers two proxies using two zero-shot models on the separate instances, and applies these proxies in the proximal g-formula.

Abstract

Recent text-based causal methods attempt to mitigate confounding bias by estimating proxies of confounding variables that are partially or imperfectly measured from unstructured text data. These approaches, however, assume analysts have supervised labels of the confounders given text for a subset of instances, a constraint that is sometimes infeasible due to data privacy or annotation costs. In this work, we address settings in which an important confounding variable is completely unobserved. We propose a new causal inference method that uses two instances of pre-treatment text data, infers two proxies using two zero-shot models on the separate instances, and applies these proxies in the proximal g-formula. We prove, under certain assumptions about the instances of text and accuracy of the zero-shot predictions, that our method of inferring text-based proxies satisfies identification conditions of the proximal g-formula while other seemingly reasonable proposals do not. To address untestable assumptions associated with our method and the proximal g-formula, we further propose an odds ratio falsification heuristic that flags when to proceed with downstream effect estimation using the inferred proxies. We evaluate our method in synthetic and semi-synthetic settings -- the latter with real-world clinical notes from MIMIC-III and open large language models for zero-shot prediction -- and find that our method produces estimates with low bias. We believe that this text-based design of proxies allows for the use of proximal causal inference in a wider range of scenarios, particularly those for which obtaining suitable proxies from structured data is difficult.

Proximal Causal Inference With Text Data

TL;DR

Abstract

Paper Structure (33 sections, 4 theorems, 7 equations, 9 figures, 21 tables, 1 algorithm)

This paper contains 33 sections, 4 theorems, 7 equations, 9 figures, 21 tables, 1 algorithm.

Introduction
Problem Setup And Motivation
Primary criticism
Our approach
Designing Text-Based Proxies
Gotcha #1: Using predictions directly in backdoor adjustment.
Gotcha #2: Using post-treatment text.
Gotcha #3: Predicting both proxies from the same instance of text.
Gotcha #4: Using a single zero-shot model.
Our Final Design Procedure
Falsification: Odds Ratio Heuristic
Empirical Experiments and Results
RQs
Fully Synthetic Experiments
Semi-Synthetic Experiments
...and 18 more sections

Key Result

Proposition 1

Using a proxy $W$ in the backdoor adjustment formula results in biased estimates of the ACE in general.

Figures (9)

Figure 1: Pipeline for proximal causal inference with text data. The top row of captions describe the general pipeline that uses text data from any setting, and the bottom italicized row describes an illustrative example based on our semi-synthetic experiments in Sec. \ref{['sec:simulations_and_empirical_procedure']}. (1) We filter to only pre-treatment text; (2 and 3) for each individual in the analysis, we select two distinct instances of text (e.g., echocardiogram and nursing notes) via metadata with the goal of satisfying ${\bf T}^\text{pre}_1 \not \perp\!\!\!\perp {\bf T}^\text{pre}_2 \mid U, {\bf C}$; (4 and 5) we use ${\bf T}^\text{pre}_1$ and ${\bf T}^\text{pre}_2$ as inputs into LLM-1 and LLM-2, respectively, to infer zero-shot proxies $Z$ and $W$. (7) If the proxies fail our odds ratio heuristic, analysis stops. (8, 9, and 10) Else, we use the proximal g-formula implied by the casual DAG to estimate the causal effect.
Figure 2: Causal DAGs (a) depicting unmeasured confounding and (b) compatible with the canonical assumptions used for proximal causal inferencetchetgen2020introduction.
Figure 3: Causal DAGs depicting several different scenarios for inferring text-based proxies. Edges with different colors and patterns from the text ${\bf T}$ to the proxies $Z$ and $W$ indicate that different zero-shot models were used. Our final recommended method is based on (d).
Figure 4: Semi-synthetic results for ACE point estimates (dots) and 95% CIs (bars). We distinguish settings that passed the odds ratio heuristic ($\checkmark$) from those that failed, with $\gamma_{\text{high}} = 2$.
Figure 5: Using both pre-treatment and post-treatment text to generate valid proxies.
...and 4 more figures

Theorems & Definitions (8)

Proposition 1
proof
Proposition 2
proof
Proposition 3
proof
Proposition 4
proof

Proximal Causal Inference With Text Data

TL;DR

Abstract

Proximal Causal Inference With Text Data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (8)