Causal DAG Summarization (Full Version)
Anna Zeng, Michael Cafarella, Batya Kenig, Markos Markakis, Brit Youngmann, Babak Salimi
TL;DR
The paper tackles the challenge of performing reliable causal inference on high-dimensional data by introducing a causal DAG summarization framework. It formalizes the problem of producing a concise summary DAG via node contractions, proving the task is NP-hard, and then offers CaGreS, a scalable greedy algorithm that minimizes information loss by counting added edges in a canonical causal DAG. It also develops s-separation to conservatively identify CI statements that hold across all compatible DAGs and proves do-calculus remains sound and complete on summary DAGs, enabling direct causal inference on summaries. Empirical results on six real datasets show CaGreS outperforms baselines in preserving causal information, improving robustness to misspecification, and delivering inference-ready, interpretable summaries with practical runtime performance. Collectively, the work advances interpretable causal modeling by delivering a principled, robust, and scalable method for summarizing complex causal structures without sacrificing inferential validity.
Abstract
Causal inference aids researchers in discovering cause-and-effect relationships, leading to scientific insights. Accurate causal estimation requires identifying confounding variables to avoid false discoveries. Pearl's causal model uses causal DAGs to identify confounding variables, but incorrect DAGs can lead to unreliable causal conclusions. However, for high dimensional data, the causal DAGs are often complex beyond human verifiability. Graph summarization is a logical next step, but current methods for general-purpose graph summarization are inadequate for causal DAG summarization. This paper addresses these challenges by proposing a causal graph summarization objective that balances graph simplification for better understanding while retaining essential causal information for reliable inference. We develop an efficient greedy algorithm and show that summary causal DAGs can be directly used for inference and are more robust to misspecification of assumptions, enhancing robustness for causal inference. Experimenting with six real-life datasets, we compared our algorithm to three existing solutions, showing its effectiveness in handling high-dimensional data and its ability to generate summary DAGs that ensure both reliable causal inference and robustness against misspecifications.
