Table of Contents
Fetching ...

Federated Causal Discovery from Heterogeneous Data

Loka Li, Ignavier Ng, Gongxu Luo, Biwei Huang, Guangyi Chen, Tongliang Liu, Bin Gu, Kun Zhang

TL;DR

The paper tackles federated causal discovery under heterogeneous, decentralized data where standard centralized approaches fail. It introduces FedCDH, a nonparametric, privacy-preserving framework that leverages a surrogate domain variable to model distribution shifts and two tasks—FCIT for skeleton discovery and FICP for causal direction—to identify causal structure across clients using only summary statistics. Key contributions include the design of FCIT and FICP, their implementation via summary statistics and random features, and empirical validation on synthetic linear Gaussian and general functional models as well as real data from fMRI and stock markets, where FedCDH outperforms baseline methods. The approach enables scalable, privacy-conscious causal discovery in domains like healthcare and finance, with potential extensions to vertically partitioned data and more efficient privacy-preserving computations.

Abstract

Conventional causal discovery methods rely on centralized data, which is inconsistent with the decentralized nature of data in many real-world situations. This discrepancy has motivated the development of federated causal discovery (FCD) approaches. However, existing FCD methods may be limited by their potentially restrictive assumptions of identifiable functional causal models or homogeneous data distributions, narrowing their applicability in diverse scenarios. In this paper, we propose a novel FCD method attempting to accommodate arbitrary causal models and heterogeneous data. We first utilize a surrogate variable corresponding to the client index to account for the data heterogeneity across different clients. We then develop a federated conditional independence test (FCIT) for causal skeleton discovery and establish a federated independent change principle (FICP) to determine causal directions. These approaches involve constructing summary statistics as a proxy of the raw data to protect data privacy. Owing to the nonparametric properties, FCIT and FICP make no assumption about particular functional forms, thereby facilitating the handling of arbitrary causal models. We conduct extensive experiments on synthetic and real datasets to show the efficacy of our method. The code is available at https://github.com/lokali/FedCDH.git.

Federated Causal Discovery from Heterogeneous Data

TL;DR

The paper tackles federated causal discovery under heterogeneous, decentralized data where standard centralized approaches fail. It introduces FedCDH, a nonparametric, privacy-preserving framework that leverages a surrogate domain variable to model distribution shifts and two tasks—FCIT for skeleton discovery and FICP for causal direction—to identify causal structure across clients using only summary statistics. Key contributions include the design of FCIT and FICP, their implementation via summary statistics and random features, and empirical validation on synthetic linear Gaussian and general functional models as well as real data from fMRI and stock markets, where FedCDH outperforms baseline methods. The approach enables scalable, privacy-conscious causal discovery in domains like healthcare and finance, with potential extensions to vertically partitioned data and more efficient privacy-preserving computations.

Abstract

Conventional causal discovery methods rely on centralized data, which is inconsistent with the decentralized nature of data in many real-world situations. This discrepancy has motivated the development of federated causal discovery (FCD) approaches. However, existing FCD methods may be limited by their potentially restrictive assumptions of identifiable functional causal models or homogeneous data distributions, narrowing their applicability in diverse scenarios. In this paper, we propose a novel FCD method attempting to accommodate arbitrary causal models and heterogeneous data. We first utilize a surrogate variable corresponding to the client index to account for the data heterogeneity across different clients. We then develop a federated conditional independence test (FCIT) for causal skeleton discovery and establish a federated independent change principle (FICP) to determine causal directions. These approaches involve constructing summary statistics as a proxy of the raw data to protect data privacy. Owing to the nonparametric properties, FCIT and FICP make no assumption about particular functional forms, thereby facilitating the handling of arbitrary causal models. We conduct extensive experiments on synthetic and real datasets to show the efficacy of our method. The code is available at https://github.com/lokali/FedCDH.git.
Paper Structure (40 sections, 17 theorems, 56 equations, 11 figures, 5 tables, 1 algorithm)

This paper contains 40 sections, 17 theorems, 56 equations, 11 figures, 5 tables, 1 algorithm.

Key Result

Lemma 1

Let $\ddot{X} \triangleq (X,Z), k_{\mathcal{\ddot{X}}}\triangleq k_{\mathcal{X}} k_{\mathcal{Z}}$, and $\mathcal{H_{\ddot{X}}}$ be the RKHS corresponding to $k_{\mathcal{\ddot{X}}}$. Assume that $\mathcal{H_X} \subset L^2_X, \mathcal{H_Y} \subset L^2_Y, \mathcal{H_Z} \subset L^2_Z$. Further assume t

Figures (11)

  • Figure 1: An illustration where the causal models of variables $V_i$ and $V_j$ are changing across domains. (a) the graph with unobserved domain-changing factors $\psi_{\ell}(\mho)$, $\theta_i(\mho)$ and $\theta_j(\mho)$; (b) the simplified graph with the surrogate variable $\mho$.
  • Figure 2: Overall framework of $\operatorname{FedCDH}$. Left: The clients will send their sample sizes and local covariance tensors to the server, for constructing the summary statistics. The federated causal discovery will be implemented on the server. Right Top: Relying on the summary statistics, we propose two submodules: federated conditional independence test and federated independent change principle, for skeleton discovery and direction determination. Right Bottom: An example of FCD with three observed variables is illustrated, where the causal modules related to $V_2$ and $V_3$ are changing.
  • Figure 3: Results of synthetic dataset on linear Gaussian model. By rows, we evaluate varying number of variables $d$, varying number of clients $K$, and varying number of samples $n_k$. By columns, we evaluate Skeleton $F_1$ ($\uparrow$), Skeleton SHD ($\downarrow$), Direction $F_1$ ($\uparrow$) and Direction SHD ($\downarrow$).
  • Figure A1: Given that $X \perp\!\!\!\perp Y | Z$, we could introduce the independence between $R_{\ddot{X}|Z}$ and $R_{Y|Z}$.
  • Figure A2: Results of the synthetic dataset on (a) linear Gaussian model and (b) general functional model. By rows in each subfigure, we evaluate varying number of variables $d$, varying number of clients $K$, and varying number of samples $n_k$. By columns in each subfigure, we evaluate Skeleton Precision ($\uparrow$), Skeleton Recall ($\uparrow$), Direction Precision ($\uparrow$) and Direction Recall ($\uparrow$).
  • ...and 6 more figures

Theorems & Definitions (18)

  • Lemma 1: Characterization of CI with Partial Cross-covariance fukumizu2007kernel
  • Lemma 2: Independent Change Principle huang2020causal
  • Lemma 3: Characterization of Conditional Independence
  • Theorem 4: Federated Conditional Independent Test
  • Theorem 5: Null Distribution Approximation
  • Theorem 6: Federated Independent Change Principle
  • Lemma 7: Estimating Covariance Matrix from Kernel Matrix
  • Theorem 8: Sufficiency of Summary Statistics
  • Lemma 9: Characteristic Kernel fukumizu2007kernel
  • Lemma 10: Characterization of CI based on Partial Association daudin1980partial
  • ...and 8 more