Counterfactual Data Augmentation with Denoising Diffusion for Graph Anomaly Detection
Chunjing Xiao, Shikang Pang, Xovee Xu, Xuan Li, Goce Trajcevski, Fan Zhou
TL;DR
CAGAD introduces an unsupervised counterfactual data augmentation framework to improve graph anomaly detection by targeting the neighborhood aggregation of heterophilic nodes. It combines a graph pointer neural network to identify heterophilic nodes with a DDPM-based anomaly generator to translate selected neighbors into anomalous ones, producing counterfactual node representations via a counterfactual GNN. The approach yields measurable improvements over strong baselines across four datasets and remains applicable to test data without labeled anomalies. The work connects to GNNs and causal representation learning by treating neighborhood manipulation as an intervention that promotes invariance and discriminability of anomalous signals. Overall, CAGAD advances graph anomaly detection by leveraging unsupervised counterfactual augmentation to mitigate over-smoothing and class-imbalance effects.
Abstract
A critical aspect of Graph Neural Networks (GNNs) is to enhance the node representations by aggregating node neighborhood information. However, when detecting anomalies, the representations of abnormal nodes are prone to be averaged by normal neighbors, making the learned anomaly representations less distinguishable. To tackle this issue, we propose CAGAD -- an unsupervised Counterfactual data Augmentation method for Graph Anomaly Detection -- which introduces a graph pointer neural network as the heterophilic node detector to identify potential anomalies whose neighborhoods are normal-node-dominant. For each identified potential anomaly, we design a graph-specific diffusion model to translate a part of its neighbors, which are probably normal, into anomalous ones. At last, we involve these translated neighbors in GNN neighborhood aggregation to produce counterfactual representations of anomalies. Through aggregating the translated anomalous neighbors, counterfactual representations become more distinguishable and further advocate detection performance. The experimental results on four datasets demonstrate that CAGAD significantly outperforms strong baselines, with an average improvement of 2.35% on F1, 2.53% on AUC-ROC, and 2.79% on AUC-PR.
