Table of Contents
Fetching ...

Estimating Visual Attribute Effects in Advertising from Observational Data: A Deepfake-Informed Double Machine Learning Approach

Yizhi Liu, Balaji Padmanabhan, Siva Viswanathan

TL;DR

DICE-DML (Deepfake-Informed Control Encoder for Double Machine Learning) is developed, a framework that leverages generative AI to disentangle treatment from confounders and provides a principled approach for causal inference with visual data when treatments and confounders coexist within images.

Abstract

Digital advertising increasingly relies on visual content, yet marketers lack rigorous methods for understanding how specific visual attributes causally affect consumer engagement. This paper addresses a fundamental methodological challenge: estimating causal effects when the treatment, such as a model's skin tone, is an attribute embedded within the image itself. Standard approaches like Double Machine Learning (DML) fail in this setting because vision encoders entangle treatment information with confounding variables, producing severely biased estimates. We develop DICE-DML (Deepfake-Informed Control Encoder for Double Machine Learning), a framework that leverages generative AI to disentangle treatment from confounders. The approach combines three mechanisms: (1) deepfake-generated image pairs that isolate treatment variation; (2) DICE-Diff adversarial learning on paired difference vectors, where background signals cancel to reveal pure treatment fingerprints; and (3) orthogonal projection that geometrically removes treatment-axis components. In simulations with known ground truth, DICE-DML reduces root mean squared error by 73-97% compared to standard DML, with the strongest improvement (97.5%) at the null effect point, demonstrating robust Type I error control. Applying DICE-DML to 232,089 Instagram influencer posts, we estimate the causal effect of skin tone on engagement. Standard DML produces diagnostically invalid results (negative outcome R^2), while DICE-DML achieves valid confounding control (R^2 = 0.63) and estimates a marginally significant negative effect of darker skin tone (-522 likes; p = 0.062), substantially smaller than the biased standard estimate. Our framework provides a principled approach for causal inference with visual data when treatments and confounders coexist within images.

Estimating Visual Attribute Effects in Advertising from Observational Data: A Deepfake-Informed Double Machine Learning Approach

TL;DR

DICE-DML (Deepfake-Informed Control Encoder for Double Machine Learning) is developed, a framework that leverages generative AI to disentangle treatment from confounders and provides a principled approach for causal inference with visual data when treatments and confounders coexist within images.

Abstract

Digital advertising increasingly relies on visual content, yet marketers lack rigorous methods for understanding how specific visual attributes causally affect consumer engagement. This paper addresses a fundamental methodological challenge: estimating causal effects when the treatment, such as a model's skin tone, is an attribute embedded within the image itself. Standard approaches like Double Machine Learning (DML) fail in this setting because vision encoders entangle treatment information with confounding variables, producing severely biased estimates. We develop DICE-DML (Deepfake-Informed Control Encoder for Double Machine Learning), a framework that leverages generative AI to disentangle treatment from confounders. The approach combines three mechanisms: (1) deepfake-generated image pairs that isolate treatment variation; (2) DICE-Diff adversarial learning on paired difference vectors, where background signals cancel to reveal pure treatment fingerprints; and (3) orthogonal projection that geometrically removes treatment-axis components. In simulations with known ground truth, DICE-DML reduces root mean squared error by 73-97% compared to standard DML, with the strongest improvement (97.5%) at the null effect point, demonstrating robust Type I error control. Applying DICE-DML to 232,089 Instagram influencer posts, we estimate the causal effect of skin tone on engagement. Standard DML produces diagnostically invalid results (negative outcome R^2), while DICE-DML achieves valid confounding control (R^2 = 0.63) and estimates a marginally significant negative effect of darker skin tone (-522 likes; p = 0.062), substantially smaller than the biased standard estimate. Our framework provides a principled approach for causal inference with visual data when treatments and confounders coexist within images.
Paper Structure (23 sections, 10 equations, 3 figures, 5 tables, 1 algorithm)

This paper contains 23 sections, 10 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: The DICE-DML Framework. The pipeline consists of five stages: (1) Input pairs of original images and their deepfake counterparts with altered treatment attribute; (2) A shared encoder (e.g., ResNet+MLP) extracts representations $Z = f(I)$ and $Z' = f(I')$; (3) The DICE-Diff adversarial learning operates on difference vectors $\Delta = Z - Z'$ where background signals cancel, training discriminators to detect treatment while the encoder learns to fool them via gradient reversal; (4) Orthogonal projection computes sample-wise treatment directions from paired differences and projects them out to obtain $Z_{\text{clean}}$; (5) The cleaned representations feed into standard DML with outcome and propensity models for treatment effect estimation.
  • Figure 2: Asymptotic Normality of Treatment Effect Estimators. Distribution of standardized t-statistics at $\tau = 0$ across 200 simulations. Naive estimation without controls (blue) produces highly dispersed estimates (std = 11.53). Standard DML with original ResNet features (orange) remains overdispersed (std = 2.83), indicating treatment leakage distorts variance. DICE-DML (green) closely tracks the theoretical $N(0,1)$ distribution (std $\approx$ 0.98), confirming valid inference and proper Type I error control.
  • Figure 3: Distribution of difference vector norms between original and deepfake image pairs. DICE-DML reduces the mean L2 distance from 3.6 to 0.9 (75% reduction), indicating that the encoder learns to produce similar representations for image pairs that differ only in treatment (skin tone), while preserving other visual information. This compression does not reflect representation collapse, as the same embeddings achieve $Y$$R^2 = 0.63$ for outcome prediction.

Theorems & Definitions (1)

  • definition 1: Visual Treatment Leakage