Generating Synthetic Text Data to Evaluate Causal Inference Methods
Zach Wood-Doughty, Ilya Shpitser, Mark Dredze
TL;DR
This work introduces a controllable synthetic DGP framework to evaluate causal inference methods using text data, addressing the challenge of unknown causal mechanisms in language. By implementing two DGPs (LDA-based and GPT-2–based) and four estimators (Propensity/Representation Matching, IPW, and Measurement Error), the authors dissect how text generation processes interact with confounding and estimator assumptions. Key findings show that simple text-matching and IPW can fail as text-generating models become more realistic, while the measurement-error approach remains robust only when a classifier $p(U|T)$ is accurate and labeled data are available. The study emphasizes the practical value of synthetic text for diagnosing causal method assumptions and points to directions for more robust, scalable estimators and realistic DGPs in NLP settings.
Abstract
Drawing causal conclusions from observational data requires making assumptions about the true data-generating process. Causal inference research typically considers low-dimensional data, such as categorical or numerical fields in structured medical records. High-dimensional and unstructured data such as natural language complicates the evaluation of causal inference methods; such evaluations rely on synthetic datasets with known causal effects. Models for natural language generation have been widely studied and perform well empirically. However, existing methods not immediately applicable to producing synthetic datasets for causal evaluations, as they do not allow for quantifying a causal effect on the text itself. In this work, we develop a framework for adapting existing generation models to produce synthetic text datasets with known causal effects. We use this framework to perform an empirical comparison of four recently-proposed methods for estimating causal effects from text data. We release our code and synthetic datasets.
