DoubleMLDeep: Estimation of Causal Effects with Multimodal Data

Sven Klaassen; Jan Teichert-Kluge; Philipp Bach; Victor Chernozhukov; Martin Spindler; Suhas Vijaykumar

DoubleMLDeep: Estimation of Causal Effects with Multimodal Data

Sven Klaassen, Jan Teichert-Kluge, Philipp Bach, Victor Chernozhukov, Martin Spindler, Suhas Vijaykumar

TL;DR

The paper addresses causal effect estimation when confounding includes unstructured multimodal data (text and images) by extending the partially linear regression and double machine learning framework to multimodal nuisances. It introduces a middle-fusion neural architecture to estimate the nuisance functions $l_0(X)=\mathbb{E}[Y|X]$ and $m_0(X)=\mathbb{E}[D|X]$ while leveraging Neyman orthogonality via an orthogonal score $\psi(W,\theta,\hat{\eta})$, ensuring root-$N$ consistency for $\hat{\theta}$. A semi-synthetic data generator based on tabular, text, and image datasets is developed to validate inference under controlled confounding, demonstrating substantial bias reduction compared to a tabular-only baseline. Empirical results show nuisance $r^2$ around 0.88–0.90 and treatment-effect estimates $\hat{\theta}$ near the true $\theta_0=0.5$ (vs. a biased baseline), suggesting that multimodal information can improve causal estimation in economics, marketing, medicine, and beyond. The work also outlines extensions to other nonparametric causal models and additional unstructured data types for future research.

Abstract

This paper explores the use of unstructured, multimodal data, namely text and images, in causal inference and treatment effect estimation. We propose a neural network architecture that is adapted to the double machine learning (DML) framework, specifically the partially linear model. An additional contribution of our paper is a new method to generate a semi-synthetic dataset which can be used to evaluate the performance of causal effect estimation in the presence of text and images as confounders. The proposed methods and architectures are evaluated on the semi-synthetic dataset and compared to standard approaches, highlighting the potential benefit of using text and images directly in causal studies. Our findings have implications for researchers and practitioners in economics, marketing, finance, medicine and data science in general who are interested in estimating causal quantities using non-traditional data.

DoubleMLDeep: Estimation of Causal Effects with Multimodal Data

TL;DR

and

while leveraging Neyman orthogonality via an orthogonal score

, ensuring root-

consistency for

. A semi-synthetic data generator based on tabular, text, and image datasets is developed to validate inference under controlled confounding, demonstrating substantial bias reduction compared to a tabular-only baseline. Empirical results show nuisance

around 0.88–0.90 and treatment-effect estimates

near the true

(vs. a biased baseline), suggesting that multimodal information can improve causal estimation in economics, marketing, medicine, and beyond. The work also outlines extensions to other nonparametric causal models and additional unstructured data types for future research.

Abstract

Paper Structure (13 sections, 22 equations, 6 figures, 1 table)

This paper contains 13 sections, 22 equations, 6 figures, 1 table.

Introduction
Literature Review and Examples
Getting started / Warm up: Double Machine Learning for Tabular Data
Double Machine Learning for Text and Images
Model
Deep Learning Architecture and Implementation Details
Simulation Study
Simulating Confounding with Text and Images
Results
Conclusion
Appendix
Definitions
Semi-Synthetic Dataset

Figures (6)

Figure 1: Examples of directed acyclic graphs (DAGs) with image and text confounding. (a) Direct confounding via image, text and tabular data. (b) Treatment decision is driven by text and images. All backdoor paths are blocked by conditioning on both image and text data.
Figure 2: High-Level PLR Model Architecture. Both nuisance components are trained simultaneously with a combined loss.
Figure 3: DAG for the semi-synthetic dataset. The confounding via the features $X=(X_{\text{tab}}, X_{\text{txt}}, X_{\text{img}})$ can be adjusted for, whereas the unexplained/noise parts $U=(U_{\text{tab}}, U_{\text{txt}}, U_{\text{img}})$ are unobserved.
Figure 4: Boxplots of $r^2$-Scores. As anticipated, the tabular data provides only $30\%$ explanatory power, but the inclusion of unstructured data increases the predictable variance to approximately $90\%$.
Figure 5: Boxplots of $\hat{\theta}$. The Embedding Model and Deep Model have similar estimates. This indicates a stable and information-rich embedding $H_E$, which provides a high explanatory contribution independent of the subsequent ML method for predicting $Y$ and $D$bengio2014representation. $\theta_0$ represents the upper bound.
...and 1 more figures

DoubleMLDeep: Estimation of Causal Effects with Multimodal Data

TL;DR

Abstract

DoubleMLDeep: Estimation of Causal Effects with Multimodal Data

Authors

TL;DR

Abstract

Table of Contents

Figures (6)