Table of Contents
Fetching ...

Large-Scale Targeted Cause Discovery via Learning from Simulated Data

Jang-Hyun Kim, Claudia Skok Gibbs, Sangdoo Yun, Hyun Oh Song, Kyunghyun Cho

TL;DR

This work proposes a novel machine learning approach for inferring causal variables of a target variable from observations that scales with linear complexity in the number of variables, efficiently scaling up to thousands of variables.

Abstract

We propose a novel machine learning approach for inferring causal variables of a target variable from observations. Our focus is on directly inferring a set of causal factors without requiring full causal graph reconstruction, which is computationally challenging in large-scale systems. The identified causal set consists of all potential regulators of the target variable under experimental settings, enabling efficient regulation through intervention. To achieve this, we train a neural network using supervised learning on simulated data to infer causality. By employing a subsampled-ensemble inference strategy, our approach scales with linear complexity in the number of variables, efficiently scaling up to thousands of variables. Empirical results demonstrate superior performance in identifying causal relationships within large-scale gene regulatory networks, outperforming existing methods that emphasize full-graph discovery. We validate our model's generalization capability across out-of-distribution graph structures and generating mechanisms, including gene regulatory networks of E. coli and the human K562 cell line. Implementation codes are available at https://github.com/snu-mllab/Targeted-Cause-Discovery.

Large-Scale Targeted Cause Discovery via Learning from Simulated Data

TL;DR

This work proposes a novel machine learning approach for inferring causal variables of a target variable from observations that scales with linear complexity in the number of variables, efficiently scaling up to thousands of variables.

Abstract

We propose a novel machine learning approach for inferring causal variables of a target variable from observations. Our focus is on directly inferring a set of causal factors without requiring full causal graph reconstruction, which is computationally challenging in large-scale systems. The identified causal set consists of all potential regulators of the target variable under experimental settings, enabling efficient regulation through intervention. To achieve this, we train a neural network using supervised learning on simulated data to infer causality. By employing a subsampled-ensemble inference strategy, our approach scales with linear complexity in the number of variables, efficiently scaling up to thousands of variables. Empirical results demonstrate superior performance in identifying causal relationships within large-scale gene regulatory networks, outperforming existing methods that emphasize full-graph discovery. We validate our model's generalization capability across out-of-distribution graph structures and generating mechanisms, including gene regulatory networks of E. coli and the human K562 cell line. Implementation codes are available at https://github.com/snu-mllab/Targeted-Cause-Discovery.
Paper Structure (62 sections, 6 theorems, 5 equations, 16 figures, 10 tables, 2 algorithms)

This paper contains 62 sections, 6 theorems, 5 equations, 16 figures, 10 tables, 2 algorithms.

Key Result

Proposition 1

The set of marginal causes of ${\textnormal{x}}_i$ is a subset of $\mathrm{An}({\textnormal{x}}_i)$.

Figures (16)

  • Figure 1: Illustration of targeted cause discovery in a causal graph. Instead of inferring the full causal graph structure, we identify a set of causal variables for a target.
  • Figure 2: Targeted cause discovery error rate as a function of shortest path length between nodes. Cause refers to our method, directly estimating the causes of a target. Structure denotes a method that infers causes from an estimated causal graph. Both methods use the same model architecture and dataset but differ in training objectives and inference strategies. \ref{['appendix:error']} provides detailed experimental setting.
  • Figure 2: AUROC (%) on random graphs (validation) and E. coli GRN (test).
  • Figure 3: Overview of our method. The left figure depicts a single training step, while the right figure illustrates the inference procedure with multiple subsampling and ensembling. Note that the intervention matrix $M$ undergoes the same subsampling as $X$, resulting in the stacked input $[X_{V,O}, M_{V,O}]$ of shape $n' \times m'\times 2$, which is then fed into the model $f_\theta$. We omit the intervention matrix in the figure for simplicity.
  • Figure 4: Benchmarking results. (a) Performance on E. coli GRN with 1565 genes over varying levels of simulator's observational fidelity. We provide AUROC, AP, and F1 score values, including standard deviations in \ref{['tab:benchmark']}. (b) The cause prediction error rate as a function of the shortest path length between variables in a causal graph.
  • ...and 11 more figures

Theorems & Definitions (11)

  • Definition 1
  • Definition 2
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Proposition 3: Algorithm complexity
  • ...and 1 more