Table of Contents
Fetching ...

Generating Synthetic Text Data to Evaluate Causal Inference Methods

Zach Wood-Doughty, Ilya Shpitser, Mark Dredze

TL;DR

This work introduces a controllable synthetic DGP framework to evaluate causal inference methods using text data, addressing the challenge of unknown causal mechanisms in language. By implementing two DGPs (LDA-based and GPT-2–based) and four estimators (Propensity/Representation Matching, IPW, and Measurement Error), the authors dissect how text generation processes interact with confounding and estimator assumptions. Key findings show that simple text-matching and IPW can fail as text-generating models become more realistic, while the measurement-error approach remains robust only when a classifier $p(U|T)$ is accurate and labeled data are available. The study emphasizes the practical value of synthetic text for diagnosing causal method assumptions and points to directions for more robust, scalable estimators and realistic DGPs in NLP settings.

Abstract

Drawing causal conclusions from observational data requires making assumptions about the true data-generating process. Causal inference research typically considers low-dimensional data, such as categorical or numerical fields in structured medical records. High-dimensional and unstructured data such as natural language complicates the evaluation of causal inference methods; such evaluations rely on synthetic datasets with known causal effects. Models for natural language generation have been widely studied and perform well empirically. However, existing methods not immediately applicable to producing synthetic datasets for causal evaluations, as they do not allow for quantifying a causal effect on the text itself. In this work, we develop a framework for adapting existing generation models to produce synthetic text datasets with known causal effects. We use this framework to perform an empirical comparison of four recently-proposed methods for estimating causal effects from text data. We release our code and synthetic datasets.

Generating Synthetic Text Data to Evaluate Causal Inference Methods

TL;DR

This work introduces a controllable synthetic DGP framework to evaluate causal inference methods using text data, addressing the challenge of unknown causal mechanisms in language. By implementing two DGPs (LDA-based and GPT-2–based) and four estimators (Propensity/Representation Matching, IPW, and Measurement Error), the authors dissect how text generation processes interact with confounding and estimator assumptions. Key findings show that simple text-matching and IPW can fail as text-generating models become more realistic, while the measurement-error approach remains robust only when a classifier is accurate and labeled data are available. The study emphasizes the practical value of synthetic text for diagnosing causal method assumptions and points to directions for more robust, scalable estimators and realistic DGPs in NLP settings.

Abstract

Drawing causal conclusions from observational data requires making assumptions about the true data-generating process. Causal inference research typically considers low-dimensional data, such as categorical or numerical fields in structured medical records. High-dimensional and unstructured data such as natural language complicates the evaluation of causal inference methods; such evaluations rely on synthetic datasets with known causal effects. Models for natural language generation have been widely studied and perform well empirically. However, existing methods not immediately applicable to producing synthetic datasets for causal evaluations, as they do not allow for quantifying a causal effect on the text itself. In this work, we develop a framework for adapting existing generation models to produce synthetic text datasets with known causal effects. We use this framework to perform an empirical comparison of four recently-proposed methods for estimating causal effects from text data. We release our code and synthetic datasets.

Paper Structure

This paper contains 25 sections, 3 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: The causal DAG we consider. $A$ is our treatment, $Y$ is our outcome, $C$ and $U$ are confounders, and $T$ is the raw text which is influenced by $U$. The counterfactual $p(Y(a))$ cannot be non-parametrically identified from $\IfStrEq{}{} {p\left(C, A, Y\right)} {p\left(C, A, Y\middle|\right)}$ alone due to unobserved confounding from $U$. Methods may make parametric assumptions on the relationship between $T$ and $U$ in order to estimate the causal effect, or assume knowledge of $\IfStrEq{T}{} {p\left(U\right)} {p\left(U\middle|T\right)}$. We parameterize $\IfStrEq{U}{} {p\left(T\right)} {p\left(T\middle|U\right)}$ with text generation models in § \ref{['sec:synthetic_dgps']}. We discuss the limitations of this DAG model and extensions to other models in § \ref{['subsec:other_dags']}.
  • Figure 2: Causal effect strengths and Trivial text generation. Blue and red bars correspond to $U=0$ and $U=1$ respectively. As $\tau$ increases, the ranked preferences between $U=0$ and $U=1$ diverge. As $\delta$ increases, the distribution is puts more weight on the ranked preferences. The x-axis indexes the 16 words in the vocabulary, with each bar indicating the probability that a word shows up at least once in a 16 word sequence. When $\tau=0.1$ and $\delta=0.1$, the distributions are close to uniform and almost entirely overlap. In all plots the $\tilde{V}_0$ order matches the x-axis order. As $\tau$ increases, the $\tilde{V}_1$ order diverges. As $\delta$ increases, both distributions become more concentrated on higher-ranked words.
  • Figure 3: DistilGPT-2 generation when we fix the random seed, template, and $\tilde{V}_\text{word}$ but vary $\delta_\text{word}$. We construct $\tilde{V}_\text{word}$ so the most-preferred words are her, magic, and ability. The model switches from his to her pronouns as $\delta$ increases. As $\delta$ further increases, sentence fluency decreases.
  • Figure 4: Joint and marginal density plots of text classifier accuracy and mean absolute causal estimation error for each DGP and each estimation method that relies on a text classifier. Each dot represents one experiment. Figure \ref{['fig:acc_vs_err_ipw_lda']} shows a zoomed-out plot for LDA+IPW; all other plots contain all data. Colors indicate the four structured variable random seeds used to create the true data-generating distributions. For the IPW and Prop methods, the visible clusters show that the relationship between classifier accuracy and causal error is highly dependent on the random seed for structured variables. Thus, for a real-world analysis with an unknown DGP, better classifier accuracy does not imply lower causal error. For the ME method, classifier accuracy and causal error are not clustered by the underlying DGP.
  • Figure 5: Zoomed-out version of Figure \ref{['fig:acc_vs_err']} for for IPW estimator on LDA data. For one random seed for structured variables (the blue cluster), causal error is quite large.
  • ...and 2 more figures