Table of Contents
Fetching ...

Causal inference using invariant prediction: identification and confidence intervals

Jonas Peters, Peter Bühlmann, Nicolai Meinshausen

TL;DR

The paper tackles causal discovery under data from multiple environments by leveraging an invariance principle: if $S^{*}$ contains the direct causal predictors of a target $Y$, then the conditional distribution $Y^{e}|X^{e}_{S^{*}}$ is invariant across environments. It develops a framework to identify plausible causal predictors by testing invariance across environments and to construct conservative confidence sets for the causal coefficients without requiring full graphical models or Do-interventions. Under linear SEMs with interventions, it provides identifiability results for the causal parents and demonstrates robustness to certain model misspecifications, with extensions to nonlinear settings and hidden variables via instrumental variables. The work also offers practical tools, including an R package, and demonstrates applications to large-scale gene perturbation data and educational studies, highlighting the approach’s potential for reliable causal inference when randomized experiments are limited or infeasible.

Abstract

What is the difference of a prediction that is made with a causal model and a non-causal model? Suppose we intervene on the predictor variables or change the whole environment. The predictions from a causal model will in general work as well under interventions as for observational data. In contrast, predictions from a non-causal model can potentially be very wrong if we actively intervene on variables. Here, we propose to exploit this invariance of a prediction under a causal model for causal inference: given different experimental settings (for example various interventions) we collect all models that do show invariance in their predictive accuracy across settings and interventions. The causal model will be a member of this set of models with high probability. This approach yields valid confidence intervals for the causal relationships in quite general scenarios. We examine the example of structural equation models in more detail and provide sufficient assumptions under which the set of causal predictors becomes identifiable. We further investigate robustness properties of our approach under model misspecification and discuss possible extensions. The empirical properties are studied for various data sets, including large-scale gene perturbation experiments.

Causal inference using invariant prediction: identification and confidence intervals

TL;DR

The paper tackles causal discovery under data from multiple environments by leveraging an invariance principle: if contains the direct causal predictors of a target , then the conditional distribution is invariant across environments. It develops a framework to identify plausible causal predictors by testing invariance across environments and to construct conservative confidence sets for the causal coefficients without requiring full graphical models or Do-interventions. Under linear SEMs with interventions, it provides identifiability results for the causal parents and demonstrates robustness to certain model misspecifications, with extensions to nonlinear settings and hidden variables via instrumental variables. The work also offers practical tools, including an R package, and demonstrates applications to large-scale gene perturbation data and educational studies, highlighting the approach’s potential for reliable causal inference when randomized experiments are limited or infeasible.

Abstract

What is the difference of a prediction that is made with a causal model and a non-causal model? Suppose we intervene on the predictor variables or change the whole environment. The predictions from a causal model will in general work as well under interventions as for observational data. In contrast, predictions from a non-causal model can potentially be very wrong if we actively intervene on variables. Here, we propose to exploit this invariance of a prediction under a causal model for causal inference: given different experimental settings (for example various interventions) we collect all models that do show invariance in their predictive accuracy across settings and interventions. The causal model will be a member of this set of models with high probability. This approach yields valid confidence intervals for the causal relationships in quite general scenarios. We examine the example of structural equation models in more detail and provide sufficient assumptions under which the set of causal predictors becomes identifiable. We further investigate robustness properties of our approach under model misspecification and discuss possible extensions. The empirical properties are studied for various data sets, including large-scale gene perturbation experiments.

Paper Structure

This paper contains 6 sections, 1 theorem, 2 equations, 2 figures.

Key Result

Proposition 1

Consider a linear structural equation model, as formally defined in Section sec:lgsem, for the variables $(X_1=Y, X_2,\ldots, X_p, X_{p+1})$, with coefficients $(\beta_{jk})_{j,k=1, \ldots, p+1}$, whose structure is given by a directed acyclic graph. The independence assumption on the noise variable

Figures (2)

  • Figure 1: An example including three environments. The invariance \ref{['eq:fullmodel']} and \ref{['invariance-nonlin']} holds if we consider $S^* = \{X_2, X_4\}$. Considering indirect causes instead of direct ones (e.g. $\{X_2, X_5\}$) or an incomplete set of direct causes (e.g. $\{X_4\}$) may not be sufficient to guarantee invariant prediction.
  • Figure 2: Some examples from the gene-knockout experiments in Kemmeren2014, which will be discussed in more detail in Section \ref{['sec:geneknockout']}. Each panel shows the distribution of a target gene activity Y (on the respective y-axis), conditional on a predictor gene activity X (shown on respective x-axis). Blue crosses show observational data and red dots show interventional data. The interventions do not occur on any of the shown genes. The conditional distribution of $Y$, given $X$, is not invariant for the examples in the first row, while invariance cannot be rejected for the two examples in the bottom row. Take the example of the bottom left panel. The variance of the activity of gene $\mathit{YMR321C}$ is clearly higher for interventional than observational data, so we can reject that the invariance assumption holds for the empty set $S=\emptyset$. However, if conditioning on the activity $X$ of gene $\mathit{YPL273W}$, the conditional distribution of the activity $Y$ of gene $\mathit{YMR321C}$ is not significantly different between interventional and observational data, so that the set $S=\{\mathit{YPL273W}\}$ fulfils the invariance assumption \ref{['eq:lincausal']}, at least approximately.

Theorems & Definitions (1)

  • Proposition 1