Table of Contents
Fetching ...

DoubleMLDeep: Estimation of Causal Effects with Multimodal Data

Sven Klaassen, Jan Teichert-Kluge, Philipp Bach, Victor Chernozhukov, Martin Spindler, Suhas Vijaykumar

TL;DR

The paper addresses causal effect estimation when confounding includes unstructured multimodal data (text and images) by extending the partially linear regression and double machine learning framework to multimodal nuisances. It introduces a middle-fusion neural architecture to estimate the nuisance functions $l_0(X)=\mathbb{E}[Y|X]$ and $m_0(X)=\mathbb{E}[D|X]$ while leveraging Neyman orthogonality via an orthogonal score $\psi(W,\theta,\hat{\eta})$, ensuring root-$N$ consistency for $\hat{\theta}$. A semi-synthetic data generator based on tabular, text, and image datasets is developed to validate inference under controlled confounding, demonstrating substantial bias reduction compared to a tabular-only baseline. Empirical results show nuisance $r^2$ around 0.88–0.90 and treatment-effect estimates $\hat{\theta}$ near the true $\theta_0=0.5$ (vs. a biased baseline), suggesting that multimodal information can improve causal estimation in economics, marketing, medicine, and beyond. The work also outlines extensions to other nonparametric causal models and additional unstructured data types for future research.

Abstract

This paper explores the use of unstructured, multimodal data, namely text and images, in causal inference and treatment effect estimation. We propose a neural network architecture that is adapted to the double machine learning (DML) framework, specifically the partially linear model. An additional contribution of our paper is a new method to generate a semi-synthetic dataset which can be used to evaluate the performance of causal effect estimation in the presence of text and images as confounders. The proposed methods and architectures are evaluated on the semi-synthetic dataset and compared to standard approaches, highlighting the potential benefit of using text and images directly in causal studies. Our findings have implications for researchers and practitioners in economics, marketing, finance, medicine and data science in general who are interested in estimating causal quantities using non-traditional data.

DoubleMLDeep: Estimation of Causal Effects with Multimodal Data

TL;DR

The paper addresses causal effect estimation when confounding includes unstructured multimodal data (text and images) by extending the partially linear regression and double machine learning framework to multimodal nuisances. It introduces a middle-fusion neural architecture to estimate the nuisance functions and while leveraging Neyman orthogonality via an orthogonal score , ensuring root- consistency for . A semi-synthetic data generator based on tabular, text, and image datasets is developed to validate inference under controlled confounding, demonstrating substantial bias reduction compared to a tabular-only baseline. Empirical results show nuisance around 0.88–0.90 and treatment-effect estimates near the true (vs. a biased baseline), suggesting that multimodal information can improve causal estimation in economics, marketing, medicine, and beyond. The work also outlines extensions to other nonparametric causal models and additional unstructured data types for future research.

Abstract

This paper explores the use of unstructured, multimodal data, namely text and images, in causal inference and treatment effect estimation. We propose a neural network architecture that is adapted to the double machine learning (DML) framework, specifically the partially linear model. An additional contribution of our paper is a new method to generate a semi-synthetic dataset which can be used to evaluate the performance of causal effect estimation in the presence of text and images as confounders. The proposed methods and architectures are evaluated on the semi-synthetic dataset and compared to standard approaches, highlighting the potential benefit of using text and images directly in causal studies. Our findings have implications for researchers and practitioners in economics, marketing, finance, medicine and data science in general who are interested in estimating causal quantities using non-traditional data.
Paper Structure (13 sections, 22 equations, 6 figures, 1 table)

This paper contains 13 sections, 22 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Examples of directed acyclic graphs (DAGs) with image and text confounding. (a) Direct confounding via image, text and tabular data. (b) Treatment decision is driven by text and images. All backdoor paths are blocked by conditioning on both image and text data.
  • Figure 2: High-Level PLR Model Architecture. Both nuisance components are trained simultaneously with a combined loss.
  • Figure 3: DAG for the semi-synthetic dataset. The confounding via the features $X=(X_{\text{tab}}, X_{\text{txt}}, X_{\text{img}})$ can be adjusted for, whereas the unexplained/noise parts $U=(U_{\text{tab}}, U_{\text{txt}}, U_{\text{img}})$ are unobserved.
  • Figure 4: Boxplots of $r^2$-Scores. As anticipated, the tabular data provides only $30\%$ explanatory power, but the inclusion of unstructured data increases the predictable variance to approximately $90\%$.
  • Figure 5: Boxplots of $\hat{\theta}$. The Embedding Model and Deep Model have similar estimates. This indicates a stable and information-rich embedding $H_E$, which provides a high explanatory contribution independent of the subsequent ML method for predicting $Y$ and $D$bengio2014representation. $\theta_0$ represents the upper bound.
  • ...and 1 more figures