Table of Contents
Fetching ...

Unifying Unsupervised Graph-Level Anomaly Detection and Out-of-Distribution Detection: A Benchmark

Yili Wang, Yixin Liu, Xu Shen, Chenyu Li, Kaize Ding, Rui Miao, Ying Wang, Shirui Pan, Xin Wang

TL;DR

This work addresses the fragmentation between unsupervised graph-level anomaly detection (GLAD) and graph-level out-of-distribution detection (GLOD) by introducing UB-GOLD, a unified benchmark for generalized graph-level OOD detection. It unifies GLAD and GLOD under a common objective and evaluates 18 methods across 35 datasets spanning intrinsic anomalies, class-based anomalies, and inter/intra-dataset shifts, using AUROC, AUPRC, and FPR95. The study provides multi-dimensional analyses of effectiveness, OOD sensitivity, robustness, and efficiency, revealing that end-to-end methods often outperform two-step approaches and that near-OOD detection remains challenging. The open-source UB-GOLD codebase and the derived insights lay the groundwork for broader, reproducible evaluation and guide future research toward universal, robust learning strategies across diverse graph domains.

Abstract

To build safe and reliable graph machine learning systems, unsupervised graph-level anomaly detection (GLAD) and unsupervised graph-level out-of-distribution (OOD) detection (GLOD) have received significant attention in recent years. Though those two lines of research indeed share the same objective, they have been studied independently in the community due to distinct evaluation setups, creating a gap that hinders the application and evaluation of methods from one to the other. To bridge the gap, in this work, we present a \underline{\textbf{U}}nified \underline{\textbf{B}}enchmark for unsupervised \underline{\textbf{G}}raph-level \underline{\textbf{O}}OD and anoma\underline{\textbf{L}}y \underline{\textbf{D}}etection (\ourmethod), a comprehensive evaluation framework that unifies GLAD and GLOD under the concept of generalized graph-level OOD detection. Our benchmark encompasses 35 datasets spanning four practical anomaly and OOD detection scenarios, facilitating the comparison of 18 representative GLAD/GLOD methods. We conduct multi-dimensional analyses to explore the effectiveness, OOD sensitivity spectrum, robustness, and efficiency of existing methods, shedding light on their strengths and limitations. Furthermore, we provide an open-source codebase (https://github.com/UB-GOLD/UB-GOLD) of \ourmethod to foster reproducible research and outline potential directions for future investigations based on our insights.

Unifying Unsupervised Graph-Level Anomaly Detection and Out-of-Distribution Detection: A Benchmark

TL;DR

This work addresses the fragmentation between unsupervised graph-level anomaly detection (GLAD) and graph-level out-of-distribution detection (GLOD) by introducing UB-GOLD, a unified benchmark for generalized graph-level OOD detection. It unifies GLAD and GLOD under a common objective and evaluates 18 methods across 35 datasets spanning intrinsic anomalies, class-based anomalies, and inter/intra-dataset shifts, using AUROC, AUPRC, and FPR95. The study provides multi-dimensional analyses of effectiveness, OOD sensitivity, robustness, and efficiency, revealing that end-to-end methods often outperform two-step approaches and that near-OOD detection remains challenging. The open-source UB-GOLD codebase and the derived insights lay the groundwork for broader, reproducible evaluation and guide future research toward universal, robust learning strategies across diverse graph domains.

Abstract

To build safe and reliable graph machine learning systems, unsupervised graph-level anomaly detection (GLAD) and unsupervised graph-level out-of-distribution (OOD) detection (GLOD) have received significant attention in recent years. Though those two lines of research indeed share the same objective, they have been studied independently in the community due to distinct evaluation setups, creating a gap that hinders the application and evaluation of methods from one to the other. To bridge the gap, in this work, we present a \underline{\textbf{U}}nified \underline{\textbf{B}}enchmark for unsupervised \underline{\textbf{G}}raph-level \underline{\textbf{O}}OD and anoma\underline{\textbf{L}}y \underline{\textbf{D}}etection (\ourmethod), a comprehensive evaluation framework that unifies GLAD and GLOD under the concept of generalized graph-level OOD detection. Our benchmark encompasses 35 datasets spanning four practical anomaly and OOD detection scenarios, facilitating the comparison of 18 representative GLAD/GLOD methods. We conduct multi-dimensional analyses to explore the effectiveness, OOD sensitivity spectrum, robustness, and efficiency of existing methods, shedding light on their strengths and limitations. Furthermore, we provide an open-source codebase (https://github.com/UB-GOLD/UB-GOLD) of \ourmethod to foster reproducible research and outline potential directions for future investigations based on our insights.
Paper Structure (25 sections, 6 equations, 12 figures, 8 tables)

This paper contains 25 sections, 6 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Diagram of generalized OOD detection scenarios supported by UB-GOLD.
  • Figure 2: An overview of UB-GOLD.
  • Figure 3: Comparison of the detection performance on 35 datasets in terms of three metrics.
  • Figure 4: Near and Far OOD performance comparison in terms of AUROC.
  • Figure 5: Performance of models under different perturbation levels in terms of AUROC.
  • ...and 7 more figures