Causal Effect Estimation with Latent Textual Treatments

Omri Feldman; Amar Venugopal; Jann Spiess; Amir Feder

Causal Effect Estimation with Latent Textual Treatments

Omri Feldman, Amar Venugopal, Jann Spiess, Amir Feder

TL;DR

An end-to-end pipeline for the generation and causal estimation of latent textual interventions, which first performs hypothesis generation and steering via sparse autoencoders (SAEs), followed by robust causal estimation.

Abstract

Understanding the causal effects of text on downstream outcomes is a central task in many applications. Estimating such effects requires researchers to run controlled experiments that systematically vary textual features. While large language models (LLMs) hold promise for generating text, producing and evaluating controlled variation requires more careful attention. In this paper, we present an end-to-end pipeline for the generation and causal estimation of latent textual interventions. Our work first performs hypothesis generation and steering via sparse autoencoders (SAEs), followed by robust causal estimation. Our pipeline addresses both computational and statistical challenges in text-as-treatment experiments. We demonstrate that naive estimation of causal effects suffers from significant bias as text inherently conflates treatment and covariate information. We describe the estimation bias induced in this setting and propose a solution based on covariate residualization. Our empirical results show that our pipeline effectively induces variation in target features and mitigates estimation error, providing a robust foundation for causal effect estimation in text-as-treatment settings.

Causal Effect Estimation with Latent Textual Treatments

TL;DR

Abstract

Paper Structure (36 sections, 3 theorems, 29 equations, 7 figures, 9 tables)

This paper contains 36 sections, 3 theorems, 29 equations, 7 figures, 9 tables.

Introduction
Preliminaries
Sparse Autoencoders (SAEs)
Conditional Average Treatment Effects (CATEs)
Related Work
SAEs and Steering
Causal Inference with Text
Datasets
Hypothesis Generation
Sparse Linear Probing
Feature Selection for Steering Interventions
Steering
Concept Intensity Scores
Coherence Score
IC Score
...and 21 more sections

Key Result

Proposition 7.1

Assume that $X^\perp |T_\phi{=}1 \stackrel{d}{=} X^\perp |T_\phi{=}0$. Then $\tau = E[Y | T_\phi{=}1] - E[Y|T_\phi{=}0]$.

Figures (7)

Figure 1: An overview of our methodology. We start with a dataset of text documents, with labels classifying the semantic concept we want to intervene on. We run probes on SAE representations and identify the most persistent and semantically relevant SAE features that serve as our hypotheses. We then steer texts to generate quasi-counterfactuals that are used in downstream experiments. Finally, we present a novel residualization approach for CATE estimation in such text-as-treatment setting, alongside theoretical guarantees and empirical evidence.
Figure 2: Our residualization pipeline. We assume that natural language is generated from latent semantic concepts, which serve as inputs to the LLM. By applying SAE steering interventions, we modify the model’s generation process and produce quasi-counterfactual texts. We then measure their ex-post intensity $I$, and use embedding model for representing the text use as controls. Finally, using our measured intensity, we residualize the embeddings to remove the target concept $\mathbf{C}_1$, which ensures that the treatment information is separated from the rest of text.
Figure 3: CATE simulation from Dataset C, using Llama-3.1-8B-Instruct, layer 23 feature 53435, and all-MiniLM-L6-v2 embeddings. Top: distribution of true (black) and estimated CATE based on raw (red) and residualized (blue) covariates. Bottom: scatter of true vs. estimated CATE values based on raw (red) and residualized (blue) covariates, with the 45-degree line in black. Residualization performed via dimension-by-dimension strategy.
Figure 4: Examples of features intensity as measures by the mean cosine similarity for Gemma-2-9b-IT. Left: a high intensity features that follow a linear trend with a relatively high slope, making it a good candidate for causal interventions. Center: a medium intensity feature with a reversed 'L' shape, where there is no response to steering at first, followed by a sudden spike, making their suitability for downstream experiments limited. Right: a low intensity feature with a constant slope despite increasing steering factors, making it completely ineffective for experimentation. Plots include only valid texts and steering factors with more than 75% validation rate.
Figure 5: Distributions of the out-of-sample predictive accuracy of treatment between raw and residualized embeddings, where residualization is conducted dimension-by-dimension (solid) and by dropping the first principle component (dashed), taken over all tested LLMs, datasets, SAE features, and embedding models.
...and 2 more figures

Theorems & Definitions (6)

Proposition 7.1: Causal identification from tight steering
Proposition 7.2: Causal identification from perfect controlling
Theorem 7.6: Bias bound for imperfect controls
proof : Proof of \ref{['prop:tightsteering']}
proof : Proof of \ref{['prop:perfectcontrolling']}
proof : Proof of \ref{['thm:biasbound']}

Causal Effect Estimation with Latent Textual Treatments

TL;DR

Abstract

Causal Effect Estimation with Latent Textual Treatments

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (6)