Towards Unbiased Evaluation of Time-series Anomaly Detector

Debarpan Bhattacharya; Sumanta Mukherjee; Chandramouli Kamanchi; Vijay Ekambaram; Arindam Jati; Pankaj Dayama

Towards Unbiased Evaluation of Time-series Anomaly Detector

Debarpan Bhattacharya, Sumanta Mukherjee, Chandramouli Kamanchi, Vijay Ekambaram, Arindam Jati, Pankaj Dayama

TL;DR

This work tackles biased evaluation in time-series anomaly detection by addressing the mismatch between time points and anomaly events. It introduces Balanced Point Adjustment (BA), an evaluation protocol with axioms ensuring robustness, threshold-agnostic behavior, and proper ordering. Through analytic derivations and a large-scale simulator study, BA is shown to provide fairer comparisons than existing metrics like PA and KPA. The results offer a principled basis for unbiased detector ranking and suggest directions for extending TSAD evaluation to more complex settings.

Abstract

Time series anomaly detection (TSAD) is an evolving area of research motivated by its critical applications, such as detecting seismic activity, sensor failures in industrial plants, predicting crashes in the stock market, and so on. Across domains, anomalies occur significantly less frequently than normal data, making the F1-score the most commonly adopted metric for anomaly detection. However, in the case of time series, it is not straightforward to use standard F1-score because of the dissociation between `time points' and `time events'. To accommodate this, anomaly predictions are adjusted, called as point adjustment (PA), before the $F_1$-score evaluation. However, these adjustments are heuristics-based, and biased towards true positive detection, resulting in over-estimated detector performance. In this work, we propose an alternative adjustment protocol called ``Balanced point adjustment'' (BA). It addresses the limitations of existing point adjustment methods and provides guarantees of fairness backed by axiomatic definitions of TSAD evaluation.

Towards Unbiased Evaluation of Time-series Anomaly Detector

TL;DR

Abstract

-score evaluation. However, these adjustments are heuristics-based, and biased towards true positive detection, resulting in over-estimated detector performance. In this work, we propose an alternative adjustment protocol called ``Balanced point adjustment'' (BA). It addresses the limitations of existing point adjustment methods and provides guarantees of fairness backed by axiomatic definitions of TSAD evaluation.

Paper Structure (28 sections, 6 theorems, 20 equations, 6 figures)

This paper contains 28 sections, 6 theorems, 20 equations, 6 figures.

Introduction
Related Works
Methods
Notations
Time-series
Anomaly labels
Anomaly segment
Anomaly detector
Metric for time-series anomaly detection
Illustrative analysis
Axiomatic criterion for TSAD metrics
Metric analysis
Pre-requisites
C-1 (robust)
C-1a (threshold agnostic)
...and 13 more sections

Key Result

Theorem 1

The point-adjusted (PA) F1 score ($F_{1PA}$) of any random time-series anomaly detector working on a sufficiently large time series of length $T$ having a single anomaly event ($S_A := S_a$) is: where $q = \frac{|S_a|}{T}$ is the anomaly ratio, $N(\cdot)$ is the noise cdf.

Figures (6)

Figure 1: A comparative view of different point adjustment methods for a given ground truth and predicted labels. We have computed the $F_{1KPA}$ score with $K=40\%$. $F_{1BA}$ is the proposed method in this paper. $F_{1BA}$ is the only metric that penalizes false positive detection. The orange highlights detection which is left as it is, and green highlights describe instances that are adjusted before $F_{1}$ score computation.
Figure 2: Comparison of $F_{1p}$, $F_{1PA}$, $F_{1KPA}$, and our proposed $F_{1BA}$. In the table, the green color shows ideal metric values for a perfect detection, while the red color highlights failure to indicate correct predictions. The proposed $F_{1BA}$ consistently makes meaningful transitions, unlike other metrics.
Figure 3: (a) The behavior of BA metrics $P_{BA}, R_{BA}, F_{1BA}$ compared to PA metrics for scores from uniform noise with varying thresholds $\gamma$, using anomaly width of $100$ and ratio $q=0.2$. $F_{1PA}$ rises above $0.75$ for random anomaly scores, (b) The right panel illustrates the behavior of $F_{1PA}$ and $F_{1BA}$ with varying $\gamma$ for different anomaly ratios ($q$), with anomaly width of 100. $F_{1PA}$ increases with higher thresholds, while $F_{1BA}$ remains unaffected by threshold choice.
Figure 4: Metric behavior plotted against the score separation metric (\ref{['def-sep']}). Plots are made for varying recall \ref{['def-recall']} in $3$ different bins of $< 25\%, (25\% - 75\%)$ and $>75\%$. The bins are chosen so that similar data point cardinality is maintained. Precision is maintained within $(25\% - 75\%)$.
Figure 5: Metric behavior plotted against the precision (\ref{['def-prec']}). Plots are made for varying coverage score \ref{['def-cov']} in $3$ different bins of $< 20\%, (20\% - 30\%)$ and $>30\%$. The bins are chosen so that similar data point cardinality is maintained. Recall is maintained within $(25\% - 75\%)$.
...and 1 more figures

Theorems & Definitions (14)

Definition 1: Point adjustment
Definition 2: Balanced Adjustment (BA)
Theorem 1: $\mathbf{F_{1PA}}$ in random noise
proof
Theorem 2: $\mathbf{F_{1BA}}$ in random noise
proof
Lemma 2.1
proof
Lemma 2.2
proof
...and 4 more

Towards Unbiased Evaluation of Time-series Anomaly Detector

TL;DR

Abstract

Towards Unbiased Evaluation of Time-series Anomaly Detector

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (14)