From Narratives to Numbers: Valid Inference Using Language Model Predictions from Verbal Autopsy Narratives

Shuxian Fan; Adam Visokay; Kentaro Hoffman; Stephen Salerno; Li Liu; Jeffrey T. Leek; Tyler H. McCormick

From Narratives to Numbers: Valid Inference Using Language Model Predictions from Verbal Autopsy Narratives

Shuxian Fan, Adam Visokay, Kentaro Hoffman, Stephen Salerno, Li Liu, Jeffrey T. Leek, Tyler H. McCormick

TL;DR

The paper addresses valid inference when verbal autopsy CODs are predicted from free-text narratives. It extends prediction-powered inference to multinomial outcomes with multiPPI++, formulating a rectified loss $L_{\ ext{lambda}}^{\text{PPI++}}(\theta)=L_n(\theta)+\lambda\left(L_N^{f_u}(\theta)-L_n^{f_l}(\theta)\right)$ to fuse labeled ground-truth and NLP predictions. Through PHMRC data and a leave-one-site-out design, it shows that multiPPI++ can recover ground-truth effects and quantify uncertainty even with imperfect COD predictions, while highlighting that better predictions do not always yield better inference. The work improves public health inference from VA data and informs how to allocate labeling effort and handle cross-site transportability in practice.

Abstract

In settings where most deaths occur outside the healthcare system, verbal autopsies (VAs) are a common tool to monitor trends in causes of death (COD). VAs are interviews with a surviving caregiver or relative that are used to predict the decedent's COD. Turning VAs into actionable insights for researchers and policymakers requires two steps (i) predicting likely COD using the VA interview and (ii) performing inference with predicted CODs (e.g. modeling the breakdown of causes by demographic factors using a sample of deaths). In this paper, we develop a method for valid inference using outcomes (in our case COD) predicted from free-form text using state-of-the-art NLP techniques. This method, which we call multiPPI++, extends recent work in "prediction-powered inference" to multinomial classification. We leverage a suite of NLP techniques for COD prediction and, through empirical analysis of VA data, demonstrate the effectiveness of our approach in handling transportability issues. multiPPI++ recovers ground truth estimates, regardless of which NLP model produced predictions and regardless of whether they were produced by a more accurate predictor like GPT-4-32k or a less accurate predictor like KNN. Our findings demonstrate the practical importance of inference correction for public health decision-making and suggests that if inference tasks are the end goal, having a small amount of contextually relevant, high quality labeled data is essential regardless of the NLP algorithm.

From Narratives to Numbers: Valid Inference Using Language Model Predictions from Verbal Autopsy Narratives

TL;DR

to fuse labeled ground-truth and NLP predictions. Through PHMRC data and a leave-one-site-out design, it shows that multiPPI++ can recover ground-truth effects and quantify uncertainty even with imperfect COD predictions, while highlighting that better predictions do not always yield better inference. The work improves public health inference from VA data and informs how to allocate labeling effort and handle cross-site transportability in practice.

Abstract

Paper Structure (19 sections, 18 equations, 54 figures, 1 table, 1 algorithm)

This paper contains 19 sections, 18 equations, 54 figures, 1 table, 1 algorithm.

Introduction
Population Health Metrics Research Consortium Narratives
Proposed Analytic Workflow
NLP for VA Narratives
Valid Statistical Inference with multiPPI++ Correction
Inference with COD Predicted from VA Narratives
Experimental Setup
NLP Prediction Results
Inferential Model Results
Discussion
Appendix
ICD-10 COD Classification
PPI and PPI++: overview
multiPPI++
Parameter Estimates Across All Sites: Full Data 80/20 Split
...and 4 more sections

Figures (54)

Figure 1: Overview of multiPPI++ correction. Ground truth labels and predicted labels are used separately to perform the same inference task in Domain A. We use the difference between these estimates as a correction factor in Domain B where ground truth labels are not available.
Figure 2: While non-communicable disease is the most common COD in each site, relative prevalence of each COD varies considerably.
Figure 3: The zero-shot prompt explicitly tags the VA narrative, provides minimal context for each COD label, lists an explicit set of COD options to coerce a constrained output, and provides direct instructions pointing to the <narrative><label><option> tags.
Figure 4: The classic and BERT models are trained with VA narratives from five other sites and used to predict COD from Uttar Pradesh narratives. The zero shot GPT-4 model predicts COD without any site specific narrative fine-tuning. Ground truth, Naive and multiPPI++ corrected inference is performed using these Uttar Pradesh predictions.
Figure 5: For VA narratives from Uttar Pradesh, most of the COD misclassifications are assigned non-communicable COD label. Naive Bayes mostly predicts non-communicable and achieves 0.60 accuracy in-part because non-communicable is overwhelmingly the most common ground truth COD.
...and 49 more figures

From Narratives to Numbers: Valid Inference Using Language Model Predictions from Verbal Autopsy Narratives

TL;DR

Abstract

From Narratives to Numbers: Valid Inference Using Language Model Predictions from Verbal Autopsy Narratives

Authors

TL;DR

Abstract

Table of Contents

Figures (54)