Table of Contents
Fetching ...

Efficient Identification of Direct Causal Parents via Invariance and Minimum Error Testing

Minh Nguyen, Mert R. Sabuncu

TL;DR

This work addresses scalable local causal discovery under distribution shifts by improving invariance-based methods. It introduces MMSE-ICP and fastICP, two algorithms that leverage a minimum-mean-squared-error (MMSE) inequality to identify direct causal parents with substantially fewer tests than classic ICP, while offering identifiability guarantees under plausible assumptions. Through extensive simulations and a large-scale gene expression study, the methods outperform baselines and achieve state-of-the-art results, demonstrating both accuracy and scalability. The work paves the way for robust causal variable identification in high-dimensional, partially perturbed systems and has potential implications for resilient representation learning and domain-general ML models.

Abstract

Invariant causal prediction (ICP) is a popular technique for finding causal parents (direct causes) of a target via exploiting distribution shifts and invariance testing (Peters et al., 2016). However, since ICP needs to run an exponential number of tests and fails to identify parents when distribution shifts only affect a few variables, applying ICP to practical large scale problems is challenging. We propose MMSE-ICP and fastICP, two approaches which employ an error inequality to address the identifiability problem of ICP. The inequality states that the minimum prediction error of the predictor using causal parents is the smallest among all predictors which do not use descendants. fastICP is an efficient approximation tailored for large problems as it exploits the inequality and a heuristic to run fewer tests. MMSE-ICP and fastICP not only outperform competitive baselines in many simulations but also achieve state-of-the-art result on a large scale real data benchmark.

Efficient Identification of Direct Causal Parents via Invariance and Minimum Error Testing

TL;DR

This work addresses scalable local causal discovery under distribution shifts by improving invariance-based methods. It introduces MMSE-ICP and fastICP, two algorithms that leverage a minimum-mean-squared-error (MMSE) inequality to identify direct causal parents with substantially fewer tests than classic ICP, while offering identifiability guarantees under plausible assumptions. Through extensive simulations and a large-scale gene expression study, the methods outperform baselines and achieve state-of-the-art results, demonstrating both accuracy and scalability. The work paves the way for robust causal variable identification in high-dimensional, partially perturbed systems and has potential implications for resilient representation learning and domain-general ML models.

Abstract

Invariant causal prediction (ICP) is a popular technique for finding causal parents (direct causes) of a target via exploiting distribution shifts and invariance testing (Peters et al., 2016). However, since ICP needs to run an exponential number of tests and fails to identify parents when distribution shifts only affect a few variables, applying ICP to practical large scale problems is challenging. We propose MMSE-ICP and fastICP, two approaches which employ an error inequality to address the identifiability problem of ICP. The inequality states that the minimum prediction error of the predictor using causal parents is the smallest among all predictors which do not use descendants. fastICP is an efficient approximation tailored for large problems as it exploits the inequality and a heuristic to run fewer tests. MMSE-ICP and fastICP not only outperform competitive baselines in many simulations but also achieve state-of-the-art result on a large scale real data benchmark.
Paper Structure (29 sections, 5 theorems, 3 equations, 21 figures, 2 tables, 3 algorithms)

This paper contains 29 sections, 5 theorems, 3 equations, 21 figures, 2 tables, 3 algorithms.

Key Result

Lemma 3.1

Let $\mathbf{X}_1$ and $\mathbf{X}_2$ denote two sets of variables, not necessarily mutually exclusive. Then: $\mathsf{MMSE}(\mathbf{X}_1 \cup \mathbf{X}_2) \leq \mathsf{MMSE}(\mathbf{X}_1)$. Equality holds if $Y\hbox{${}\perp\mkern-11mu\perp{}$} \mathbf{X}_2|\mathbf{X}_1$.

Figures (21)

  • Figure 1: Consider this directed acyclic graph (DAG) representing a noisy causal structural model as a motivating example, where no causal mechanism is noise-free and thus there are no duplicate variables. Let $E$ denote the variable capturing the environment (or context) mooij2020joint. Its children are directly affected by intervention. This reflect distribution shifts between contexts. $Y$ is the target variable. $\hat{\mathbf{S}}_{\text{ICP}}=\emptyset$ because the invariant sets are $\{X_1\}$, $\{X_2\}$, and $\{X_1, X_2\}$. In contrast, $\hat{\mathbf{S}}_{\text{IAS}}=\{X_1, X_2\}$. Our methods output $\{X_1\}$ since the prediction error of $\hat{Y}_M(X_1)$ is less than $\hat{Y}_M(X_2)$'s.
  • Figure 2: Performance when $N_{\text{int}}=d=6$ (Table \ref{['tbl:sim_properties']}, No. 1). Linear simulation. Reference set: $\mathsf{PA}(Y)$.
  • Figure 3: Performance when $N_{\text{int}}{=}1; d{=}6$ (Table \ref{['tbl:sim_properties']}, No. 2). Linear simulation. See Appendix \ref{['app:linear']} for results when $N_{\text{int}}{=}2$ or $N_{\text{int}}{=}3$.
  • Figure 4: Performance for large graphs. Reference set: $\mathbf{S}^*$. Also see Figure \ref{['fig:app_d100']} and \ref{['fig:app_d100d']} in the Appendix.
  • Figure 5: Performance for nonlinear simulation. $N_{\text{int}}{=}1; d{=}6$. Reference set: $\mathbf{S}^*$. $nl$: using nonlinear regression in invariance test. Also see Figure \ref{['fig:app_nl']} in the Appendix.
  • ...and 16 more figures

Theorems & Definitions (9)

  • Lemma 3.1: Error Inequality
  • proof
  • Corollary 3.2
  • proof
  • Lemma 3.3
  • proof
  • Theorem 3.4: Identifiability
  • Proposition 3.5
  • proof