Table of Contents
Fetching ...

Causal Inference on Outcomes Learned from Text

Iman Modarressi, Jann Spiess, Amar Venugopal

TL;DR

This work addresses how to draw causal inferences from unstructured text in randomized trials by combining large language models with a rigorous econometric framework. It introduces a three-stage approach: (i) test whether treatment changes text distributions via a hold-out permutation test that predicts group labels from text, (ii) describe the differences with low-dimensional causal themes learned by an LLM and validated by humans, and (iii) assess the completeness of the description against a model-free benchmark. The framework handles the challenges of interpretability and inference from complex text by using sample splitting, human scoring, and a bias-corrected estimator that blends cheap machine scores with costly human labels. A proof-of-concept on arXiv abstracts using Google Gemini demonstrates detectable group differences, interpretable themes, and high completeness, illustrating practical viability and highlighting the importance of pre-specification, replication, and human–AI complementarity in text-based causal analysis.

Abstract

We propose a machine-learning tool that yields causal inference on text in randomized trials. Based on a simple econometric framework in which text may capture outcomes of interest, our procedure addresses three questions: First, is the text affected by the treatment? Second, which outcomes is the effect on? And third, how complete is our description of causal effects? To answer all three questions, our approach uses large language models (LLMs) that suggest systematic differences across two groups of text documents and then provides valid inference based on costly validation. Specifically, we highlight the need for sample splitting to allow for statistical validation of LLM outputs, as well as the need for human labeling to validate substantive claims about how documents differ across groups. We illustrate the tool in a proof-of-concept application using abstracts of academic manuscripts.

Causal Inference on Outcomes Learned from Text

TL;DR

This work addresses how to draw causal inferences from unstructured text in randomized trials by combining large language models with a rigorous econometric framework. It introduces a three-stage approach: (i) test whether treatment changes text distributions via a hold-out permutation test that predicts group labels from text, (ii) describe the differences with low-dimensional causal themes learned by an LLM and validated by humans, and (iii) assess the completeness of the description against a model-free benchmark. The framework handles the challenges of interpretability and inference from complex text by using sample splitting, human scoring, and a bias-corrected estimator that blends cheap machine scores with costly human labels. A proof-of-concept on arXiv abstracts using Google Gemini demonstrates detectable group differences, interpretable themes, and high completeness, illustrating practical viability and highlighting the importance of pre-specification, replication, and human–AI complementarity in text-based causal analysis.

Abstract

We propose a machine-learning tool that yields causal inference on text in randomized trials. Based on a simple econometric framework in which text may capture outcomes of interest, our procedure addresses three questions: First, is the text affected by the treatment? Second, which outcomes is the effect on? And third, how complete is our description of causal effects? To answer all three questions, our approach uses large language models (LLMs) that suggest systematic differences across two groups of text documents and then provides valid inference based on costly validation. Specifically, we highlight the need for sample splitting to allow for statistical validation of LLM outputs, as well as the need for human labeling to validate substantive claims about how documents differ across groups. We illustrate the tool in a proof-of-concept application using abstracts of academic manuscripts.

Paper Structure

This paper contains 33 sections, 4 theorems, 25 equations, 2 figures, 3 tables.

Key Result

Proposition 1

If $(\widehat{W}_i)_{i \in \mathcal{H}} \perp (W_i)_{i \in \mathcal{H}} \: | \: (Y_i,W_i)_{i \in \mathcal{T}},(Y_i)_{i \in \mathcal{H}}$, then the test based on the permutation $p$-value $\widehat{p}$ provides (conditionally) valid size control in the sense that $\mathop{\mathrm{P}}\nolimits(\wideha

Figures (2)

  • Figure 1: Description of group differences provided by the LLM based on the training sample.
  • Figure 2: Tradeoff between the number of human-labeled hold-out datapoints ($\ell$) and the precision of the simple average treatment effect estimator for each theme.

Theorems & Definitions (8)

  • Proposition 1: Valid permutation-based test
  • Proposition 2: Valid inference on themes
  • Proposition 3: Valid inference with combined scores
  • Definition 1: Completness of descriptions
  • proof : Proof of \ref{['prop:Test']}
  • Lemma 1: A conditional Berry--Esseen inequality
  • proof : Proof of \ref{['prop:Inference']}
  • proof : Proof of \ref{['prop:Combining']}