Table of Contents
Fetching ...

DQE: A Semantic-Aware Evaluation Metric for Time Series Anomaly Detection

Yuewei Li, Dalin Zhang, Huan Li, Xinyi Gong, Hongjun Chu, Zhaohui Song

TL;DR

This work revisit the evaluation of time series anomaly detection from the perspective of detection semantics and proposes a novel metric for more comprehensive assessment, which provides stable, discriminative, and interpretable evaluation, while achieving robust assessment compared with ten widely used metrics.

Abstract

Time series anomaly detection has achieved remarkable progress in recent years. However, evaluation practices have received comparatively less attention, despite their critical importance. Existing metrics exhibit several limitations: (1) bias toward point-level coverage, (2) insensitivity or inconsistency in near-miss detections, (3) inadequate penalization of false alarms, and (4) inconsistency caused by threshold or threshold-interval selection. These limitations can produce unreliable or counterintuitive results, hindering objective progress. In this work, we revisit the evaluation of time series anomaly detection from the perspective of detection semantics and propose a novel metric for more comprehensive assessment. We first introduce a partitioning strategy grounded in detection semantics, which decomposes the local temporal region of each anomaly into three functionally distinct subregions. Using this partitioning, we evaluate overall detection behavior across events and design finer-grained scoring mechanisms for each subregion, enabling more reliable and interpretable assessment. Through a systematic study of existing metrics, we identify an evaluation bias associated with threshold-interval selection and adopt an approach that aggregates detection qualities across the full threshold spectrum, thereby eliminating evaluation inconsistency. Extensive experiments on synthetic and real-world data demonstrate that our metric provides stable, discriminative, and interpretable evaluation, while achieving robust assessment compared with ten widely used metrics.

DQE: A Semantic-Aware Evaluation Metric for Time Series Anomaly Detection

TL;DR

This work revisit the evaluation of time series anomaly detection from the perspective of detection semantics and proposes a novel metric for more comprehensive assessment, which provides stable, discriminative, and interpretable evaluation, while achieving robust assessment compared with ten widely used metrics.

Abstract

Time series anomaly detection has achieved remarkable progress in recent years. However, evaluation practices have received comparatively less attention, despite their critical importance. Existing metrics exhibit several limitations: (1) bias toward point-level coverage, (2) insensitivity or inconsistency in near-miss detections, (3) inadequate penalization of false alarms, and (4) inconsistency caused by threshold or threshold-interval selection. These limitations can produce unreliable or counterintuitive results, hindering objective progress. In this work, we revisit the evaluation of time series anomaly detection from the perspective of detection semantics and propose a novel metric for more comprehensive assessment. We first introduce a partitioning strategy grounded in detection semantics, which decomposes the local temporal region of each anomaly into three functionally distinct subregions. Using this partitioning, we evaluate overall detection behavior across events and design finer-grained scoring mechanisms for each subregion, enabling more reliable and interpretable assessment. Through a systematic study of existing metrics, we identify an evaluation bias associated with threshold-interval selection and adopt an approach that aggregates detection qualities across the full threshold spectrum, thereby eliminating evaluation inconsistency. Extensive experiments on synthetic and real-world data demonstrate that our metric provides stable, discriminative, and interpretable evaluation, while achieving robust assessment compared with ten widely used metrics.
Paper Structure (53 sections, 18 equations, 11 figures, 13 tables)

This paper contains 53 sections, 18 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Output anomaly scores of various algorithms, normalized to the range [0, 1].
  • Figure 2: AUC curves of two algorithms.
  • Figure 3: Locality and integrity of detections. $a_{1-3}$ represent anomaly events; $p_{1-4}$ represent detection events.
  • Figure 4: Partitioning strategy and local detection event groups. $a_{1-3}$ denote anomaly events; $p_{1-10}$ denote detection events. Blue, green, and red regions represent $A_{\mathtt{cap}}$, $A_{\mathtt{nm}}$, and $A_{\mathtt{fa}}$, respectively. $p_5$–$p_6$ belong to $D_{\mathtt{cap}}$; $p_3$–$p_4$ and $p_7$–$p_8$ belong to $D_{\mathtt{nm}}$; $p_1$–$p_2$ and $p_9$–$p_{10}$ belong to $D_{\mathtt{fa}}$.
  • Figure 5: Score and score-difference curves of various metrics under variations in the number of anomalies (a, d), anomaly duration (b, e), and anomaly ratio (c, f).
  • ...and 6 more figures