Table of Contents
Fetching ...

Smoke and Mirrors in Causal Downstream Tasks

Riccardo Cadei, Lukas Lindorfer, Sylvia Cremer, Cordelia Schmid, Francesco Locatello

TL;DR

This paper examines how learning pipelines affect causal downstream questions when estimating the Average Treatment Effect from high-dimensional observations collected in randomized trials. The authors develop a theoretical framework for treatment-effect bias ($TEB$) and demonstrate that data sampling, encoder choice, and post-processing (e.g., discretization) can substantially distort causal estimates even in ideal RCT conditions. They validate these insights on ISTAnt, a real-world ant-grooming dataset, and on a synthetic CausalMNIST benchmark, showing that seemingly benign design choices can yield biased $ATE$ estimates and that prediction accuracy is a poor proxy for causal validity. The work provides practical guidelines for representation learning in scientific contexts and introduces benchmarks to advance causal downstream learning with high-dimensional data.

Abstract

Machine Learning and AI have the potential to transform data-driven scientific discovery, enabling accurate predictions for several scientific phenomena. As many scientific questions are inherently causal, this paper looks at the causal inference task of treatment effect estimation, where the outcome of interest is recorded in high-dimensional observations in a Randomized Controlled Trial (RCT). Despite being the simplest possible causal setting and a perfect fit for deep learning, we theoretically find that many common choices in the literature may lead to biased estimates. To test the practical impact of these considerations, we recorded ISTAnt, the first real-world benchmark for causal inference downstream tasks on high-dimensional observations as an RCT studying how garden ants (Lasius neglectus) respond to microparticles applied onto their colony members by hygienic grooming. Comparing 6 480 models fine-tuned from state-of-the-art visual backbones, we find that the sampling and modeling choices significantly affect the accuracy of the causal estimate, and that classification accuracy is not a proxy thereof. We further validated the analysis, repeating it on a synthetically generated visual data set controlling the causal model. Our results suggest that future benchmarks should carefully consider real downstream scientific questions, especially causal ones. Further, we highlight guidelines for representation learning methods to help answer causal questions in the sciences.

Smoke and Mirrors in Causal Downstream Tasks

TL;DR

This paper examines how learning pipelines affect causal downstream questions when estimating the Average Treatment Effect from high-dimensional observations collected in randomized trials. The authors develop a theoretical framework for treatment-effect bias () and demonstrate that data sampling, encoder choice, and post-processing (e.g., discretization) can substantially distort causal estimates even in ideal RCT conditions. They validate these insights on ISTAnt, a real-world ant-grooming dataset, and on a synthetic CausalMNIST benchmark, showing that seemingly benign design choices can yield biased estimates and that prediction accuracy is a poor proxy for causal validity. The work provides practical guidelines for representation learning in scientific contexts and introduces benchmarks to advance causal downstream learning with high-dimensional data.

Abstract

Machine Learning and AI have the potential to transform data-driven scientific discovery, enabling accurate predictions for several scientific phenomena. As many scientific questions are inherently causal, this paper looks at the causal inference task of treatment effect estimation, where the outcome of interest is recorded in high-dimensional observations in a Randomized Controlled Trial (RCT). Despite being the simplest possible causal setting and a perfect fit for deep learning, we theoretically find that many common choices in the literature may lead to biased estimates. To test the practical impact of these considerations, we recorded ISTAnt, the first real-world benchmark for causal inference downstream tasks on high-dimensional observations as an RCT studying how garden ants (Lasius neglectus) respond to microparticles applied onto their colony members by hygienic grooming. Comparing 6 480 models fine-tuned from state-of-the-art visual backbones, we find that the sampling and modeling choices significantly affect the accuracy of the causal estimate, and that classification accuracy is not a proxy thereof. We further validated the analysis, repeating it on a synthetically generated visual data set controlling the causal model. Our results suggest that future benchmarks should carefully consider real downstream scientific questions, especially causal ones. Further, we highlight guidelines for representation learning methods to help answer causal questions in the sciences.
Paper Structure (43 sections, 4 theorems, 34 equations, 11 figures, 7 tables)

This paper contains 43 sections, 4 theorems, 34 equations, 11 figures, 7 tables.

Key Result

Lemma 3.1

Assuming the setting described in Section sec:setting. A predictive model $f$ for the factual outcomes with accuracy $1$-$\epsilon$ can lead to $|\text{TEB}($f$)|=\frac{\epsilon}{\min_t P(T=t)}\geq 2\epsilon$, which invalidates any causal conclusion when the ATE is comparable with $\epsilon$ and/or

Figures (11)

  • Figure 1: Causal Model for generic partially annotated scientific experiment: $T$ treatment, $\bm{W}$ experimental settings, $\bm{X}$ high-dimensional observation, $Y$ outcome, $S$ annotation flag.
  • Figure 2: Examples of high-dimensional observations $\bm{X}$ with corresponding annotated social behaviour $Y$ from ISTAnt (ours).
  • Figure 3: Monte-Carlo simulation of the discretization bias' convergence result.
  • Figure 4: Violin plots comparing the Treatment Effect Relative Bias (TERB) per annotation criteria in few and many annotations regime. Biased annotations lead to biased ATE estimation (i.e., TERB$\neq$0) and random annotation should be preferred.
  • Figure 5: Scatter plot comparing the TERB and balanced accuracy in prediction among the 20 best models per 6 established encoders. Despite different downstream prediction performances, all the encoders (with excepts of MAE) lead to similar TERB (up to $\pm$ 50%).
  • ...and 6 more figures

Theorems & Definitions (8)

  • Definition 3.1: Treatment Effect Bias
  • Lemma 3.1: Informal
  • Theorem 3.1
  • Example 1
  • Lemma
  • proof
  • Theorem
  • proof : Proof