Unifying Unsupervised Graph-Level Anomaly Detection and Out-of-Distribution Detection: A Benchmark

Yili Wang; Yixin Liu; Xu Shen; Chenyu Li; Kaize Ding; Rui Miao; Ying Wang; Shirui Pan; Xin Wang

Unifying Unsupervised Graph-Level Anomaly Detection and Out-of-Distribution Detection: A Benchmark

Yili Wang, Yixin Liu, Xu Shen, Chenyu Li, Kaize Ding, Rui Miao, Ying Wang, Shirui Pan, Xin Wang

TL;DR

This work addresses the fragmentation between unsupervised graph-level anomaly detection (GLAD) and graph-level out-of-distribution detection (GLOD) by introducing UB-GOLD, a unified benchmark for generalized graph-level OOD detection. It unifies GLAD and GLOD under a common objective and evaluates 18 methods across 35 datasets spanning intrinsic anomalies, class-based anomalies, and inter/intra-dataset shifts, using AUROC, AUPRC, and FPR95. The study provides multi-dimensional analyses of effectiveness, OOD sensitivity, robustness, and efficiency, revealing that end-to-end methods often outperform two-step approaches and that near-OOD detection remains challenging. The open-source UB-GOLD codebase and the derived insights lay the groundwork for broader, reproducible evaluation and guide future research toward universal, robust learning strategies across diverse graph domains.

Abstract

To build safe and reliable graph machine learning systems, unsupervised graph-level anomaly detection (GLAD) and unsupervised graph-level out-of-distribution (OOD) detection (GLOD) have received significant attention in recent years. Though those two lines of research indeed share the same objective, they have been studied independently in the community due to distinct evaluation setups, creating a gap that hinders the application and evaluation of methods from one to the other. To bridge the gap, in this work, we present a \underline{\textbf{U}}nified \underline{\textbf{B}}enchmark for unsupervised \underline{\textbf{G}}raph-level \underline{\textbf{O}}OD and anoma\underline{\textbf{L}}y \underline{\textbf{D}}etection (\ourmethod), a comprehensive evaluation framework that unifies GLAD and GLOD under the concept of generalized graph-level OOD detection. Our benchmark encompasses 35 datasets spanning four practical anomaly and OOD detection scenarios, facilitating the comparison of 18 representative GLAD/GLOD methods. We conduct multi-dimensional analyses to explore the effectiveness, OOD sensitivity spectrum, robustness, and efficiency of existing methods, shedding light on their strengths and limitations. Furthermore, we provide an open-source codebase (https://github.com/UB-GOLD/UB-GOLD) of \ourmethod to foster reproducible research and outline potential directions for future investigations based on our insights.

Unifying Unsupervised Graph-Level Anomaly Detection and Out-of-Distribution Detection: A Benchmark

TL;DR

Abstract

Paper Structure (25 sections, 6 equations, 12 figures, 8 tables)

This paper contains 25 sections, 6 equations, 12 figures, 8 tables.

Introduction
Related Work and Problem Definition
Benchmark Design
Benchmark Datasets
Benchmark Algorithms
Evaluation Metrics and Implementation Details
Experimental Results & Discussion
Performance Comparison (RQ1)
OOD sensitivity spectrum on Near-OOD and Far-OOD (RQ2)
Robustness under Training Set Perturbation (RQ3)
Efficiency Analysis (RQ4)
Conclusion and Future Directions
Why UB-GOLD focuses on unsupervised scenarios
Detailed Description of datasets in UB-GOLD
Detailed Description of Algorithms in UB-GOLD
...and 10 more sections

Figures (12)

Figure 1: Diagram of generalized OOD detection scenarios supported by UB-GOLD.
Figure 2: An overview of UB-GOLD.
Figure 3: Comparison of the detection performance on 35 datasets in terms of three metrics.
Figure 4: Near and Far OOD performance comparison in terms of AUROC.
Figure 5: Performance of models under different perturbation levels in terms of AUROC.
...and 7 more figures

Unifying Unsupervised Graph-Level Anomaly Detection and Out-of-Distribution Detection: A Benchmark

TL;DR

Abstract

Unifying Unsupervised Graph-Level Anomaly Detection and Out-of-Distribution Detection: A Benchmark

Authors

TL;DR

Abstract

Table of Contents

Figures (12)