Table of Contents
Fetching ...

Recovering Latent Confounders from High-dimensional Proxy Variables

Nathan Mankovich, Homer Durand, Emiliano Diaz, Gherardo Varando, Gustau Camps-Valls

Abstract

Detecting latent confounders from proxy variables is an essential problem in causal effect estimation. Previous approaches are limited to low-dimensional proxies, sorted proxies, and binary treatments. We remove these assumptions and present a novel Proxy Confounder Factorization (PCF) framework for continuous treatment effect estimation when latent confounders manifest through high-dimensional, mixed proxy variables. For specific sample sizes, our two-step PCF implementation, using Independent Component Analysis (ICA-PCF), and the end-to-end implementation, using Gradient Descent (GD-PCF), achieve high correlation with the latent confounder and low absolute error in causal effect estimation with synthetic datasets in the high sample size regime. Even when faced with climate data, ICA-PCF recovers four components that explain $75.9\%$ of the variance in the North Atlantic Oscillation, a known confounder of precipitation patterns in Europe. Code for our PCF implementations and experiments can be found here: https://github.com/IPL-UV/confound_it. The proposed methodology constitutes a stepping stone towards discovering latent confounders and can be applied to many problems in disciplines dealing with high-dimensional observed proxies, e.g., spatiotemporal fields.

Recovering Latent Confounders from High-dimensional Proxy Variables

Abstract

Detecting latent confounders from proxy variables is an essential problem in causal effect estimation. Previous approaches are limited to low-dimensional proxies, sorted proxies, and binary treatments. We remove these assumptions and present a novel Proxy Confounder Factorization (PCF) framework for continuous treatment effect estimation when latent confounders manifest through high-dimensional, mixed proxy variables. For specific sample sizes, our two-step PCF implementation, using Independent Component Analysis (ICA-PCF), and the end-to-end implementation, using Gradient Descent (GD-PCF), achieve high correlation with the latent confounder and low absolute error in causal effect estimation with synthetic datasets in the high sample size regime. Even when faced with climate data, ICA-PCF recovers four components that explain of the variance in the North Atlantic Oscillation, a known confounder of precipitation patterns in Europe. Code for our PCF implementations and experiments can be found here: https://github.com/IPL-UV/confound_it. The proposed methodology constitutes a stepping stone towards discovering latent confounders and can be applied to many problems in disciplines dealing with high-dimensional observed proxies, e.g., spatiotemporal fields.
Paper Structure (24 sections, 8 equations, 8 figures, 2 tables)

This paper contains 24 sections, 8 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: DAG studied in this work and induced by the Structural Causal Model (SCM) in Eq. \ref{['eq:SCM']}.
  • Figure 2: PCF implementations on synthetic data with exponential and Gaussian distributions for sampling $Z$. Performance is evaluated across different sample sizes in terms of correlation with the true confounder and estimating the causal effect. Metrics for the true confounder ($Z_c$) are labeled "oracle." $x$-axis is log-scale and $y$-axes for AE and AER are log scale.
  • Figure 3: PCF implementations on synthetic data with gamma and uniform distributions for sampling $Z$. Performance is evaluated across different sample sizes in terms of correlation with the true confounder and estimating the causal effect. Metrics for the true confounder ($Z_c$) are labeled "oracle." $x$-axis is log-scale and $y$-axes for AE and AER are log scale.
  • Figure 4: Relative performance of PCF implementations on synthetic data with respect to baselines (lasso, ridge, and elastic net) to estimate the causal coefficient. Values are plotted across different sample sizes. Metrics for the true confounder ($Z_c$) are labeled "oracle." The $x$- and $y$-axes are in log-scale.
  • Figure 5: ICA-PCF detects $4$ latent confounders ($\hat{\mathbf{z}}^{(1)}_c$, $\hat{\mathbf{z}}^{(2)}_c$, $\hat{\mathbf{z}}^{(3)}_c$, $\hat{\mathbf{z}}^{(4)}_c$) from geopotential height. Their respective correlations with NAO are $0.69$, $-0.28$, $0.05$, and $-0.45$. The edges are the estimated causal effect (regression coefficients).
  • ...and 3 more figures