Table of Contents
Fetching ...

Detection of Unobserved Common Causes based on NML Code in Discrete, Mixed, and Continuous Variables

Masatoshi Kobayashi, Kohei Miyagichi, Shin Matsushima

TL;DR

This work extends CLOUD, a Normalized Maximum Likelihood (NML) code-based causal-discovery method, to discrete, mixed, and continuous data for identifying unobserved common causes under Reichenbach's principle. By modeling four SCM-based relationships (independence, direct causality in either direction, and latent confounding via an unobserved C) and using MDL-based codelengths to compare models, CLOUD avoids assuming a specific form for unobserved confounders. The authors prove consistency and demonstrate through synthetic and real-world experiments that CLOUD delivers high-accuracy causal inference across data types, including the detection of latent confounding where appropriate. This enables robust causal analysis from observational data in diverse domains, eliminating strong, type-specific assumptions about hidden variables.

Abstract

Causal discovery in the presence of unobserved common causes from observational data only is a crucial but challenging problem. We categorize all possible causal relationships between two random variables into the following four categories and aim to identify one from observed data: two cases in which either of the direct causality exists, a case that variables are independent, and a case that variables are confounded by latent confounders. Although existing methods have been proposed to tackle this problem, they require unobserved variables to satisfy assumptions on the form of their equation models. In our previous study (Kobayashi et al., 2022), the first causal discovery method without such assumptions is proposed for discrete data and named CLOUD. Using Normalized Maximum Likelihood (NML) Code, CLOUD selects a model that yields the minimum codelength of the observed data from a set of model candidates. This paper extends CLOUD to apply for various data types across discrete, mixed, and continuous. We not only performed theoretical analysis to show the consistency of CLOUD in terms of the model selection, but also demonstrated that CLOUD is more effective than existing methods in inferring causal relationships by extensive experiments on both synthetic and real-world data.

Detection of Unobserved Common Causes based on NML Code in Discrete, Mixed, and Continuous Variables

TL;DR

This work extends CLOUD, a Normalized Maximum Likelihood (NML) code-based causal-discovery method, to discrete, mixed, and continuous data for identifying unobserved common causes under Reichenbach's principle. By modeling four SCM-based relationships (independence, direct causality in either direction, and latent confounding via an unobserved C) and using MDL-based codelengths to compare models, CLOUD avoids assuming a specific form for unobserved confounders. The authors prove consistency and demonstrate through synthetic and real-world experiments that CLOUD delivers high-accuracy causal inference across data types, including the detection of latent confounding where appropriate. This enables robust causal analysis from observational data in diverse domains, eliminating strong, type-specific assumptions about hidden variables.

Abstract

Causal discovery in the presence of unobserved common causes from observational data only is a crucial but challenging problem. We categorize all possible causal relationships between two random variables into the following four categories and aim to identify one from observed data: two cases in which either of the direct causality exists, a case that variables are independent, and a case that variables are confounded by latent confounders. Although existing methods have been proposed to tackle this problem, they require unobserved variables to satisfy assumptions on the form of their equation models. In our previous study (Kobayashi et al., 2022), the first causal discovery method without such assumptions is proposed for discrete data and named CLOUD. Using Normalized Maximum Likelihood (NML) Code, CLOUD selects a model that yields the minimum codelength of the observed data from a set of model candidates. This paper extends CLOUD to apply for various data types across discrete, mixed, and continuous. We not only performed theoretical analysis to show the consistency of CLOUD in terms of the model selection, but also demonstrated that CLOUD is more effective than existing methods in inferring causal relationships by extensive experiments on both synthetic and real-world data.
Paper Structure (42 sections, 6 theorems, 84 equations, 7 figures, 5 tables, 4 algorithms)

This paper contains 42 sections, 6 theorems, 84 equations, 7 figures, 5 tables, 4 algorithms.

Key Result

Proposition 1

For a given discrete data $z^n$ and the discrete causal models $M\in\left\{M_{X \mathop{\!\perp\!\!\!\perp\!} Y},M_{X \to Y},M_{X \gets Y},M_{X \gets C \to Y}\right\}$, the codelengths defined as above have the following expressions: where and similarity for $\ell_{Y}$ and $\ell_{X|Y}$. Here, $\hat{f}$ and $\hat{g}$ are functions derived through maximum likelihood estimation.

Figures (7)

  • Figure 1: Confusion matrices in the Discrete Case of Experiment 1
  • Figure 2: Confusion matrices in the Mixed Case of Experiment 1
  • Figure 3: Confusion matrices in the Continuous Case of Experiment 1
  • Figure 4: Accuracy vs. decision rate of CLOUD on synthetic data
  • Figure 5: Scatter plots of the Tübingen Cause-Effect Pairs. The horizontal axis represents $X$, while the vertical axis represents $Y$. Each plot corresponds to a dataset pair.
  • ...and 2 more figures

Theorems & Definitions (12)

  • Example 1: The NML Codelength for a Discrete Data
  • Example 2: The NML codelength for a Continuous Data
  • Example 3: Additive Noise Model ($\textsf{ANM}$)
  • Example 4: Linear NonGaussian Acyclic Model ($\textsf{LiNGAM}$)
  • Example 5: Linear Mixed causal model ($\textsf{LiM}$)
  • Example 6: LiNGAM with latent confounder ($\textsf{lvLiNGAM}$)
  • Proposition 1: NML-based codelength for discrete data
  • Proposition 2: Codelength in Continuous Case
  • Theorem 1
  • Theorem 2
  • ...and 2 more