Table of Contents
Fetching ...

CIDER: Counterfactual-Invariant Diffusion-based GNN Explainer for Causal Subgraph Inference

Qibin Zhang, Chengshang Lyu, Lingxi Chen, Qiqi Jin, Luonan Chen

TL;DR

CIDER tackles the problem of causal subgraph inference from measured graph data by distinguishing edges that causally drive labels from spurious ones. It introduces a counterfactual-invariant diffusion framework that jointly learns distributions over causal and spurious subgraphs using a two-channel VGAE and a diffusion process, enabling interventional causality analysis and robust causal strength estimation. The approach is validated theoretically and empirically on synthetic benchmarks and real-world biological datasets, including COVID-19 scRNA-seq and TCGA-LAML, demonstrating strong causal explanations, substantial network sparsification with minimal performance loss, and biologically meaningful insights. As a model- and task-agnostic method, CIDER offers a generalizable tool for interventional causal inference in graphs and has potential to advance explainability in biological network analysis and beyond.

Abstract

Inferring causal links or subgraphs corresponding to a specific phenotype or label based solely on measured data is an important yet challenging task, which is also different from inferring causal nodes. While Graph Neural Network (GNN) Explainers have shown potential in subgraph identification, existing methods with GNN often offer associative rather than causal insights. This lack of transparency and explainability hinders our understanding of their results and also underlying mechanisms. To address this issue, we propose a novel method of causal link/subgraph inference, called CIDER: Counterfactual-Invariant Diffusion-based GNN ExplaineR, by implementing both counterfactual and diffusion implementations. In other words, it is a model-agnostic and task-agnostic framework for generating causal explanations based on a counterfactual-invariant and diffusion process, which provides not only causal subgraphs due to counterfactual implementation but reliable causal links due to the diffusion process. Specifically, CIDER is first formulated as an inference task that generatively provides the two distributions of one causal subgraph and another spurious subgraph. Then, to enhance the reliability, we further model the CIDER framework as a diffusion process. Thus, using the causal subgraph distribution, we can explicitly quantify the contribution of each subgraph to a phenotype/label in a counterfactual manner, representing each subgraph's causal strength. From a causality perspective, CIDER is an interventional causal method, different from traditional association studies or observational causal approaches, and can also reduce the effects of unobserved confounders. We evaluate CIDER on both synthetic and real-world datasets, which all demonstrate the superiority of CIDER over state-of-the-art methods.

CIDER: Counterfactual-Invariant Diffusion-based GNN Explainer for Causal Subgraph Inference

TL;DR

CIDER tackles the problem of causal subgraph inference from measured graph data by distinguishing edges that causally drive labels from spurious ones. It introduces a counterfactual-invariant diffusion framework that jointly learns distributions over causal and spurious subgraphs using a two-channel VGAE and a diffusion process, enabling interventional causality analysis and robust causal strength estimation. The approach is validated theoretically and empirically on synthetic benchmarks and real-world biological datasets, including COVID-19 scRNA-seq and TCGA-LAML, demonstrating strong causal explanations, substantial network sparsification with minimal performance loss, and biologically meaningful insights. As a model- and task-agnostic method, CIDER offers a generalizable tool for interventional causal inference in graphs and has potential to advance explainability in biological network analysis and beyond.

Abstract

Inferring causal links or subgraphs corresponding to a specific phenotype or label based solely on measured data is an important yet challenging task, which is also different from inferring causal nodes. While Graph Neural Network (GNN) Explainers have shown potential in subgraph identification, existing methods with GNN often offer associative rather than causal insights. This lack of transparency and explainability hinders our understanding of their results and also underlying mechanisms. To address this issue, we propose a novel method of causal link/subgraph inference, called CIDER: Counterfactual-Invariant Diffusion-based GNN ExplaineR, by implementing both counterfactual and diffusion implementations. In other words, it is a model-agnostic and task-agnostic framework for generating causal explanations based on a counterfactual-invariant and diffusion process, which provides not only causal subgraphs due to counterfactual implementation but reliable causal links due to the diffusion process. Specifically, CIDER is first formulated as an inference task that generatively provides the two distributions of one causal subgraph and another spurious subgraph. Then, to enhance the reliability, we further model the CIDER framework as a diffusion process. Thus, using the causal subgraph distribution, we can explicitly quantify the contribution of each subgraph to a phenotype/label in a counterfactual manner, representing each subgraph's causal strength. From a causality perspective, CIDER is an interventional causal method, different from traditional association studies or observational causal approaches, and can also reduce the effects of unobserved confounders. We evaluate CIDER on both synthetic and real-world datasets, which all demonstrate the superiority of CIDER over state-of-the-art methods.
Paper Structure (29 sections, 10 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 29 sections, 10 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration and framework of CIDER. (a) Conceptual illustration of CIDER; (b) Framework of CIDER. For a graph with a label $Y$ is given, CIDER gives the causal subgraph, which causally affects the label, and the spurious subgraph, which does not affect the label.
  • Figure 2: COVID19 scRNA-seq analyses by CIDER. (a) Three subgraphs generated by STRING illustrate the division of 30 key genes into three distinct clusters, revealing potential unique functions and interactions within the molecular network related to COVID-19. (b) KEGG Pathways related to ribosome, apoptosis, tuberculosis, and COVID-19 were enriched, suggesting a direct connection between the key genes and the disease mechanisms. (c and d) Most genes were enriched in immunity-related keywords, supporting the effectiveness of the CIDER method in identifying COVID-19-related genes. It further emphasized the role of the key genes in the inflammatory conditions of COVID-19 and their potential prognostic value for severe cases.
  • Figure 3: TCGA-LAML analyses by CIDER. (a) Network subgraphs generated by STRING illustrate the division of 51 key genes into two distinct clusters, revealing potential unique functions and interactions within the molecular network related to acute myeloid leukemia (Red points) and ribosome (Blue points). (b) KEGG Pathways related to two types of myeloid leukemia, pathways in cancer, and cancer-related signaling pathways were enriched, suggesting a direct connection between the key genes and the leukemia mechanisms. (c) Gene ontology biological process enrichment analysis for the key genes identified by CIDER. (d) Gene ontology cellular component enrichment analysis for the key genes identified by CIDER.
  • Figure 4: Causal structure of CIDER. (a) depicts the directed acyclic graph (DAG) of our method, where $E^c$ has a direct causal effect on $Y$, and $E^s$ represents spurious factors. (b) illustrates a simple 0-1 decision region of $E^c$ and $E^s$ in a 2D space, where the star and triangle indicate different graph labels, and their solid and dotted borders denote the original sample and the counterfactual sample in $E^s$, respectively. (c) is the illustration of the diffusion process for the separation of causal factors. The red subgraph has a direct causal effect on the given label.
  • Figure 5: Workflow of CIDER for inferring the causal genes and cell types associated with a disease state from scRNA-seq data