Table of Contents
Fetching ...

Learning Causality for Longitudinal Data

Mouad EL Bouchattaoui

TL;DR

This thesis tackles causal inference in high-dimensional, time-varying data by developing three core contributions: CDVAE, a Causal Dynamic Variational Autoencoder that estimates Individual Treatment Effects under unobserved adjustment variables with theoretical guarantees; a CPC/InfoMax–enhanced RNN framework for efficient long-horizon counterfactual regression that handles time-varying confounders without transformers; and a CRL approach with a Jacobian-based, model-agnostic interpretability layer that uncovers modular latent-to-observed structure with formal recovery guarantees. It integrates potential-outcome theory, sequential ignorability, and deconfounding perspectives to address missing confounders, causal feedback, and policy learning in dynamic regimes. The work provides both strong empirical performance—approaching oracle-like accuracy on synthetic and semi-synthetic datasets—and practical tools for interpretable causal representation in complex longitudinal data, with applications to precision medicine and retail analytics. Overall, it advances causal representation learning and scalable causal inference in dynamic, high-dimensional environments, enabling personalized, data-driven decision making under uncertainty.

Abstract

This thesis develops methods for causal inference and causal representation learning (CRL) in high-dimensional, time-varying data. The first contribution introduces the Causal Dynamic Variational Autoencoder (CDVAE), a model for estimating Individual Treatment Effects (ITEs) by capturing unobserved heterogeneity in treatment response driven by latent risk factors that affect only outcomes. CDVAE comes with theoretical guarantees on valid latent adjustment and generalization bounds for ITE error. Experiments on synthetic and real datasets show that CDVAE outperforms baselines, and that state-of-the-art models greatly improve when augmented with its latent substitutes, approaching oracle performance without access to true adjustment variables. The second contribution proposes an efficient framework for long-term counterfactual regression based on RNNs enhanced with Contrastive Predictive Coding (CPC) and InfoMax. It captures long-range dependencies under time-varying confounding while avoiding the computational cost of transformers, achieving state-of-the-art results and introducing CPC into causal inference. The third contribution advances CRL by addressing how latent causes manifest in observed variables. We introduce a model-agnostic interpretability layer based on the geometry of the decoder Jacobian. A sparse self-expression prior induces modular, possibly overlapping groups of observed features aligned with shared latent influences. We provide recovery guarantees in both disjoint and overlapping settings and show that meaningful latent-to-observed structure can be recovered without anchor features or single-parent assumptions. Scalable Jacobian-based regularization techniques are also developed.

Learning Causality for Longitudinal Data

TL;DR

This thesis tackles causal inference in high-dimensional, time-varying data by developing three core contributions: CDVAE, a Causal Dynamic Variational Autoencoder that estimates Individual Treatment Effects under unobserved adjustment variables with theoretical guarantees; a CPC/InfoMax–enhanced RNN framework for efficient long-horizon counterfactual regression that handles time-varying confounders without transformers; and a CRL approach with a Jacobian-based, model-agnostic interpretability layer that uncovers modular latent-to-observed structure with formal recovery guarantees. It integrates potential-outcome theory, sequential ignorability, and deconfounding perspectives to address missing confounders, causal feedback, and policy learning in dynamic regimes. The work provides both strong empirical performance—approaching oracle-like accuracy on synthetic and semi-synthetic datasets—and practical tools for interpretable causal representation in complex longitudinal data, with applications to precision medicine and retail analytics. Overall, it advances causal representation learning and scalable causal inference in dynamic, high-dimensional environments, enabling personalized, data-driven decision making under uncertainty.

Abstract

This thesis develops methods for causal inference and causal representation learning (CRL) in high-dimensional, time-varying data. The first contribution introduces the Causal Dynamic Variational Autoencoder (CDVAE), a model for estimating Individual Treatment Effects (ITEs) by capturing unobserved heterogeneity in treatment response driven by latent risk factors that affect only outcomes. CDVAE comes with theoretical guarantees on valid latent adjustment and generalization bounds for ITE error. Experiments on synthetic and real datasets show that CDVAE outperforms baselines, and that state-of-the-art models greatly improve when augmented with its latent substitutes, approaching oracle performance without access to true adjustment variables. The second contribution proposes an efficient framework for long-term counterfactual regression based on RNNs enhanced with Contrastive Predictive Coding (CPC) and InfoMax. It captures long-range dependencies under time-varying confounding while avoiding the computational cost of transformers, achieving state-of-the-art results and introducing CPC into causal inference. The third contribution advances CRL by addressing how latent causes manifest in observed variables. We introduce a model-agnostic interpretability layer based on the geometry of the decoder Jacobian. A sparse self-expression prior induces modular, possibly overlapping groups of observed features aligned with shared latent influences. We provide recovery guarantees in both disjoint and overlapping settings and show that meaningful latent-to-observed structure can be recovered without anchor features or single-parent assumptions. Scalable Jacobian-based regularization techniques are also developed.

Paper Structure

This paper contains 80 sections, 174 equations, 4 figures, 26 tables, 1 algorithm.

Figures (4)

  • Figure 1: A compressed description of the data generation process behind the causal Question \ref{['question:cdvae']}.
  • Figure 2: A compressed description of the data generation process behind the causal Question \ref{['question:ccpc']}.
  • Figure 3: A general causal graph clarifying the causal links between outcome $Y$, treatment $W$, confounders $X$, adjustment variables $U$ and instruments $I$.
  • Figure 4: A causal graph is assumed to generate longitudinal data of 3-time steps ($T = 3$). Edges are colored (pink, blue, and red) whenever the causal relation may contribute to confounding treatment and response at any time step $t$.

Theorems & Definitions (14)

  • proof : CATE Identifiability
  • proof : Augmented CATE Identifiability
  • proof : Theorem \ref{['thm:valid_Z']}
  • proof : ELBO
  • proof
  • proof : Theorem \ref{['thm:vaes_to_truell']}
  • proof : theorem \ref{['thm:pehe_wbound']}
  • proof
  • proof
  • proof
  • ...and 4 more