Table of Contents
Fetching ...

KnowGraph: Knowledge-Enabled Anomaly Detection via Logical Reasoning on Graph Data

Andy Zhou, Xiaojun Xu, Ramesh Raghunathan, Alok Lal, Xinze Guan, Bin Yu, Bo Li

TL;DR

The proposed KnowGraph, which integrates domain knowledge with data-driven learning for enhanced graph-based anomaly detection, consistently outperforms state-of-the-art baselines in both transductive and inductive settings, achieving substantial gains in average precision when generalizing to completely unseen test graphs.

Abstract

Graph-based anomaly detection is pivotal in diverse security applications, such as fraud detection in transaction networks and intrusion detection for network traffic. Standard approaches, including Graph Neural Networks (GNNs), often struggle to generalize across shifting data distributions. Meanwhile, real-world domain knowledge is more stable and a common existing component of real-world detection strategies. To explicitly integrate such knowledge into data-driven models such as GCNs, we propose KnowGraph, which integrates domain knowledge with data-driven learning for enhanced graph-based anomaly detection. KnowGraph comprises two principal components: (1) a statistical learning component that utilizes a main model for the overarching detection task, augmented by multiple specialized knowledge models that predict domain-specific semantic entities; (2) a reasoning component that employs probabilistic graphical models to execute logical inferences based on model outputs, encoding domain knowledge through weighted first-order logic formulas. Extensive experiments on these large-scale real-world datasets show that KnowGraph consistently outperforms state-of-the-art baselines in both transductive and inductive settings, achieving substantial gains in average precision when generalizing to completely unseen test graphs. Further ablation studies demonstrate the effectiveness of the proposed reasoning component in improving detection performance, especially under extreme class imbalance. These results highlight the potential of integrating domain knowledge into data-driven models for high-stakes, graph-based security applications.

KnowGraph: Knowledge-Enabled Anomaly Detection via Logical Reasoning on Graph Data

TL;DR

The proposed KnowGraph, which integrates domain knowledge with data-driven learning for enhanced graph-based anomaly detection, consistently outperforms state-of-the-art baselines in both transductive and inductive settings, achieving substantial gains in average precision when generalizing to completely unseen test graphs.

Abstract

Graph-based anomaly detection is pivotal in diverse security applications, such as fraud detection in transaction networks and intrusion detection for network traffic. Standard approaches, including Graph Neural Networks (GNNs), often struggle to generalize across shifting data distributions. Meanwhile, real-world domain knowledge is more stable and a common existing component of real-world detection strategies. To explicitly integrate such knowledge into data-driven models such as GCNs, we propose KnowGraph, which integrates domain knowledge with data-driven learning for enhanced graph-based anomaly detection. KnowGraph comprises two principal components: (1) a statistical learning component that utilizes a main model for the overarching detection task, augmented by multiple specialized knowledge models that predict domain-specific semantic entities; (2) a reasoning component that employs probabilistic graphical models to execute logical inferences based on model outputs, encoding domain knowledge through weighted first-order logic formulas. Extensive experiments on these large-scale real-world datasets show that KnowGraph consistently outperforms state-of-the-art baselines in both transductive and inductive settings, achieving substantial gains in average precision when generalizing to completely unseen test graphs. Further ablation studies demonstrate the effectiveness of the proposed reasoning component in improving detection performance, especially under extreme class imbalance. These results highlight the potential of integrating domain knowledge into data-driven models for high-stakes, graph-based security applications.

Paper Structure

This paper contains 34 sections, 8 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Examples of anomaly detection on graph-structure data with GNN models. Data-driven learning approaches have been successful on node-level, edge-level, and subgraph-level tasks but tend to consider different levels of the graph separately, focusing on a single level.
  • Figure 2: An overview of the learning and reasoning components of KnowGraph. KnowGraph consists of a learning component composed of a main GNN model trained on the overall task and multiple knowledge GNN models trained on separate objectives, such as predicting relevant sub-attributes. The reasoning component performs logical reasoning based on the outputs of each model, which is organized based on domain knowledge rules. These rules are assigned weights, modeled by a learned scalable reasoning model parameterized by $\theta$, which explicitly ensures that the final predictions comply with the domain knowledge rules, improving reliability.
  • Figure 3: An illustration of the graph from real eBay marketplace data containing the core entities of a transaction, which are represented as nodes in a knowledge graph consisting of nodes of transactions (TXN), users (USR), and items (ITM). Examples of collusive and benign transactions are shown. Example node attributes are listed below the entity, such as gross merchandise value (gmv), account age (acc$\_$age), and total feedback (tot$\_$fdbk). Domain knowledge suggests that collusive transactions typically involve discrepancies between billing and delivery zip codes and are associated with users who have minimal feedback and newer accounts. This understanding has led to the formulation of rules such as "$[\texttt{feedback amt} < a] \land [(\texttt{seller\_age} < b \lor \texttt{buyer\_age} < c)] \Rightarrow \texttt{collusion}$", which help refine the main model's predictions based on these indicators.
  • Figure 4: Detection AUC of baselines and KnowGraph given different time shifts on the LANL dataset. In the challenging inductive setting (yellow), the baseline performance drops significantly while KnowGraph still maintains high detection performance.
  • Figure 5: Model prediction logits on the different time shifts for (top) Euler, (middle) EncG, and (bottom) KnowGraph. Low prediction logits indicate that the access is considered malicious. We can observe that all approaches perform well for the transductive setting; for the inductive setting, it is easier for KnowGraph to separate the malicious accesses by setting a threshold at around -1.5. However, it is difficult to classify benign and malicious accesses based on the prediction logits of Euler and EncG.
  • ...and 2 more figures