DQE: A Semantic-Aware Evaluation Metric for Time Series Anomaly Detection

Yuewei Li; Dalin Zhang; Huan Li; Xinyi Gong; Hongjun Chu; Zhaohui Song

DQE: A Semantic-Aware Evaluation Metric for Time Series Anomaly Detection

Yuewei Li, Dalin Zhang, Huan Li, Xinyi Gong, Hongjun Chu, Zhaohui Song

TL;DR

This work revisit the evaluation of time series anomaly detection from the perspective of detection semantics and proposes a novel metric for more comprehensive assessment, which provides stable, discriminative, and interpretable evaluation, while achieving robust assessment compared with ten widely used metrics.

Abstract

Time series anomaly detection has achieved remarkable progress in recent years. However, evaluation practices have received comparatively less attention, despite their critical importance. Existing metrics exhibit several limitations: (1) bias toward point-level coverage, (2) insensitivity or inconsistency in near-miss detections, (3) inadequate penalization of false alarms, and (4) inconsistency caused by threshold or threshold-interval selection. These limitations can produce unreliable or counterintuitive results, hindering objective progress. In this work, we revisit the evaluation of time series anomaly detection from the perspective of detection semantics and propose a novel metric for more comprehensive assessment. We first introduce a partitioning strategy grounded in detection semantics, which decomposes the local temporal region of each anomaly into three functionally distinct subregions. Using this partitioning, we evaluate overall detection behavior across events and design finer-grained scoring mechanisms for each subregion, enabling more reliable and interpretable assessment. Through a systematic study of existing metrics, we identify an evaluation bias associated with threshold-interval selection and adopt an approach that aggregates detection qualities across the full threshold spectrum, thereby eliminating evaluation inconsistency. Extensive experiments on synthetic and real-world data demonstrate that our metric provides stable, discriminative, and interpretable evaluation, while achieving robust assessment compared with ten widely used metrics.

DQE: A Semantic-Aware Evaluation Metric for Time Series Anomaly Detection

TL;DR

Abstract

Paper Structure (53 sections, 18 equations, 11 figures, 13 tables)

This paper contains 53 sections, 18 equations, 11 figures, 13 tables.

Introduction
Limitation Analysis
L1: Bias toward Point-Level Coverage
L2: Insensitivity or Inconsistency in Near-Miss Detections
L3: Inadequate Penalization of False Alarms
L4: Inconsistency Caused by Threshold or Threshold-Interval Selection
PROPOSED METRIC
Preliminary Concepts
Time Series Anomaly Detection
Partitioning Strategy and Local Detection Event Group
Partitioning Strategy
Local Detection Event Group
Local Evaluation
Capture of the GT Anomaly Event
Near-Miss Detections Around Anomalies
...and 38 more sections

Figures (11)

Figure 1: Output anomaly scores of various algorithms, normalized to the range [0, 1].
Figure 2: AUC curves of two algorithms.
Figure 3: Locality and integrity of detections. $a_{1-3}$ represent anomaly events; $p_{1-4}$ represent detection events.
Figure 4: Partitioning strategy and local detection event groups. $a_{1-3}$ denote anomaly events; $p_{1-10}$ denote detection events. Blue, green, and red regions represent $A_{\mathtt{cap}}$, $A_{\mathtt{nm}}$, and $A_{\mathtt{fa}}$, respectively. $p_5$–$p_6$ belong to $D_{\mathtt{cap}}$; $p_3$–$p_4$ and $p_7$–$p_8$ belong to $D_{\mathtt{nm}}$; $p_1$–$p_2$ and $p_9$–$p_{10}$ belong to $D_{\mathtt{fa}}$.
Figure 5: Score and score-difference curves of various metrics under variations in the number of anomalies (a, d), anomaly duration (b, e), and anomaly ratio (c, f).
...and 6 more figures

DQE: A Semantic-Aware Evaluation Metric for Time Series Anomaly Detection

TL;DR

Abstract

DQE: A Semantic-Aware Evaluation Metric for Time Series Anomaly Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (11)