Federated Causal Discovery from Heterogeneous Data
Loka Li, Ignavier Ng, Gongxu Luo, Biwei Huang, Guangyi Chen, Tongliang Liu, Bin Gu, Kun Zhang
TL;DR
The paper tackles federated causal discovery under heterogeneous, decentralized data where standard centralized approaches fail. It introduces FedCDH, a nonparametric, privacy-preserving framework that leverages a surrogate domain variable to model distribution shifts and two tasks—FCIT for skeleton discovery and FICP for causal direction—to identify causal structure across clients using only summary statistics. Key contributions include the design of FCIT and FICP, their implementation via summary statistics and random features, and empirical validation on synthetic linear Gaussian and general functional models as well as real data from fMRI and stock markets, where FedCDH outperforms baseline methods. The approach enables scalable, privacy-conscious causal discovery in domains like healthcare and finance, with potential extensions to vertically partitioned data and more efficient privacy-preserving computations.
Abstract
Conventional causal discovery methods rely on centralized data, which is inconsistent with the decentralized nature of data in many real-world situations. This discrepancy has motivated the development of federated causal discovery (FCD) approaches. However, existing FCD methods may be limited by their potentially restrictive assumptions of identifiable functional causal models or homogeneous data distributions, narrowing their applicability in diverse scenarios. In this paper, we propose a novel FCD method attempting to accommodate arbitrary causal models and heterogeneous data. We first utilize a surrogate variable corresponding to the client index to account for the data heterogeneity across different clients. We then develop a federated conditional independence test (FCIT) for causal skeleton discovery and establish a federated independent change principle (FICP) to determine causal directions. These approaches involve constructing summary statistics as a proxy of the raw data to protect data privacy. Owing to the nonparametric properties, FCIT and FICP make no assumption about particular functional forms, thereby facilitating the handling of arbitrary causal models. We conduct extensive experiments on synthetic and real datasets to show the efficacy of our method. The code is available at https://github.com/lokali/FedCDH.git.
