Table of Contents
Fetching ...

GNNs for Time Series Anomaly Detection: An Open-Source Framework and a Critical Evaluation

Federico Bello, Gonzalo Chiarlone, Marcelo Fiori, Gastón García González, Federico Larroca

TL;DR

This work presents an open-source framework for TSAD using GNNs, designed to support reproducible experimentation across datasets, graph structures, and evaluation strategies and facilitates systematic comparisons between TSAD models and enables in-depth analysis of performance and interpretability.

Abstract

There is growing interest in applying graph-based methods to Time Series Anomaly Detection (TSAD), particularly Graph Neural Networks (GNNs), as they naturally model dependencies among multivariate signals. GNNs are typically used as backbones in score-based TSAD pipelines, where anomalies are identified through reconstruction or prediction errors followed by thresholding. However, and despite promising results, the field still lacks standardized frameworks for evaluation and suffers from persistent issues with metric design and interpretation. We thus present an open-source framework for TSAD using GNNs, designed to support reproducible experimentation across datasets, graph structures, and evaluation strategies. Built with flexibility and extensibility in mind, the framework facilitates systematic comparisons between TSAD models and enables in-depth analysis of performance and interpretability. Using this tool, we evaluate several GNN-based architectures alongside baseline models across two real-world datasets with contrasting structural characteristics. Our results show that GNNs not only improve detection performance but also offer significant gains in interpretability, an especially valuable feature for practical diagnosis. We also find that attention-based GNNs offer robustness when graph structure is uncertain or inferred. In addition, we reflect on common evaluation practices in TSAD, showing how certain metrics and thresholding strategies can obscure meaningful comparisons. Overall, this work contributes both practical tools and critical insights to advance the development and evaluation of graph-based TSAD systems.

GNNs for Time Series Anomaly Detection: An Open-Source Framework and a Critical Evaluation

TL;DR

This work presents an open-source framework for TSAD using GNNs, designed to support reproducible experimentation across datasets, graph structures, and evaluation strategies and facilitates systematic comparisons between TSAD models and enables in-depth analysis of performance and interpretability.

Abstract

There is growing interest in applying graph-based methods to Time Series Anomaly Detection (TSAD), particularly Graph Neural Networks (GNNs), as they naturally model dependencies among multivariate signals. GNNs are typically used as backbones in score-based TSAD pipelines, where anomalies are identified through reconstruction or prediction errors followed by thresholding. However, and despite promising results, the field still lacks standardized frameworks for evaluation and suffers from persistent issues with metric design and interpretation. We thus present an open-source framework for TSAD using GNNs, designed to support reproducible experimentation across datasets, graph structures, and evaluation strategies. Built with flexibility and extensibility in mind, the framework facilitates systematic comparisons between TSAD models and enables in-depth analysis of performance and interpretability. Using this tool, we evaluate several GNN-based architectures alongside baseline models across two real-world datasets with contrasting structural characteristics. Our results show that GNNs not only improve detection performance but also offer significant gains in interpretability, an especially valuable feature for practical diagnosis. We also find that attention-based GNNs offer robustness when graph structure is uncertain or inferred. In addition, we reflect on common evaluation practices in TSAD, showing how certain metrics and thresholding strategies can obscure meaningful comparisons. Overall, this work contributes both practical tools and critical insights to advance the development and evaluation of graph-based TSAD systems.
Paper Structure (9 sections, 1 equation, 7 figures, 2 tables)

This paper contains 9 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Example of point-wise evaluation limitations. Although only one long anomaly is detected, point-wise metrics report a high Recall (0.8) and perfect Precision (1.0). This gives the false impression of good performance, despite the fact that most anomaly ranges in the dataset remain undetected.
  • Figure 2: Range-based recall example. R1 and R3 get a high existence reward, R2 none. R1 obtains a high size score, R3 a low one, and R2 none. Cardinality is high for R3 but lower for R1, since the anomaly is detected as two separate segments instead of one. No position reward is considered here.
  • Figure 3: Example of how range-based metrics' configuration can misrepresent performance. The model produces an overall poor prediction: the long anomaly is detected through multiple fragmented predictions, and several extended false positives occur across the timeline. However, under certain range-based metric configurations, such as existence-only recall and precision without cardinality penalty, the evaluation yields a high performance, masking the model's true shortcomings.
  • Figure 4: Anomaly score distributions on the SWaT test set for all models (in logarithmic scale). GRU and GDN exhibit better separation between normal (green bars) and anomalous (red) scores, while GCN and MTAD-GAT exhibit significant overlap, complicating threshold selection.
  • Figure 5: Correlation between the different metrics and the validation loss for the different models.
  • ...and 2 more figures