Table of Contents
Fetching ...

Rethinking Metrics and Benchmarks of Video Anomaly Detection

Zihao Liu, Xiaoyu Wu, Wenna Li, Linlin Yang, Shengjin Wang

TL;DR

This paper critiques the current Evaluation of Video Anomaly Detection (VAD) by revealing annotation bias, latency-insensitive metrics, and scene-overfitting blind spots in benchmarks. It introduces Probabilistic AUC ($ProbAUC$) and ProbAP ($ProbAP$) to integrate multi-round annotations, and a Latency-aware AP ($LaAP$) to reward early detections. To stress-test generalization, it creates diffusion-based hard-normal benchmarks (UCF-HN, MSAD-HN) that pair normal scenes with abnormal contexts, exposing overfitting in ten state-of-the-art methods. Empirically, the proposed metrics shift performance rankings and reveal higher false-alarm rates on hard normals, underscoring the need for latency-aware, bias-robust evaluation and diffusion-driven benchmarks for robust VAD development.

Abstract

Video Anomaly Detection (VAD), which aims to detect anomalies that deviate from expectation, has attracted increasing attention in recent years. Existing advancements in VAD primarily focus on model architectures and training strategies, while devoting insufficient attention to evaluation metrics and benchmarks. In this paper, we rethink VAD evaluation methods through comprehensive analyses, revealing three critical limitations in current practices: 1) existing metrics are significantly influenced by single annotation bias; 2) current metrics fail to reward early detection of anomalies; 3) available benchmarks lack the capability to evaluate scene overfitting of fully/weakly-supervised algorithms. To address these limitations, we propose three novel evaluation methods: first, we establish probabilistic AUC/AP (Prob-AUC/AP) metrics utlizing multi-round annotations to mitigate single annotation bias; second, we develop a Latency-aware Average Precision (LaAP) metric that rewards early and accurate anomaly detection; and finally, we introduce two hard normal benchmarks (UCF-HN, MSAD-HN) with videos specifically designed to evaluate scene overfitting. We report performance comparisons of ten state-of-the-art VAD approaches using our proposed evaluation methods, providing novel perspectives for future VAD model development. We release our data and code in https://github.com/Kamino666/RethinkingVAD.

Rethinking Metrics and Benchmarks of Video Anomaly Detection

TL;DR

This paper critiques the current Evaluation of Video Anomaly Detection (VAD) by revealing annotation bias, latency-insensitive metrics, and scene-overfitting blind spots in benchmarks. It introduces Probabilistic AUC () and ProbAP () to integrate multi-round annotations, and a Latency-aware AP () to reward early detections. To stress-test generalization, it creates diffusion-based hard-normal benchmarks (UCF-HN, MSAD-HN) that pair normal scenes with abnormal contexts, exposing overfitting in ten state-of-the-art methods. Empirically, the proposed metrics shift performance rankings and reveal higher false-alarm rates on hard normals, underscoring the need for latency-aware, bias-robust evaluation and diffusion-driven benchmarks for robust VAD development.

Abstract

Video Anomaly Detection (VAD), which aims to detect anomalies that deviate from expectation, has attracted increasing attention in recent years. Existing advancements in VAD primarily focus on model architectures and training strategies, while devoting insufficient attention to evaluation metrics and benchmarks. In this paper, we rethink VAD evaluation methods through comprehensive analyses, revealing three critical limitations in current practices: 1) existing metrics are significantly influenced by single annotation bias; 2) current metrics fail to reward early detection of anomalies; 3) available benchmarks lack the capability to evaluate scene overfitting of fully/weakly-supervised algorithms. To address these limitations, we propose three novel evaluation methods: first, we establish probabilistic AUC/AP (Prob-AUC/AP) metrics utlizing multi-round annotations to mitigate single annotation bias; second, we develop a Latency-aware Average Precision (LaAP) metric that rewards early and accurate anomaly detection; and finally, we introduce two hard normal benchmarks (UCF-HN, MSAD-HN) with videos specifically designed to evaluate scene overfitting. We report performance comparisons of ten state-of-the-art VAD approaches using our proposed evaluation methods, providing novel perspectives for future VAD model development. We release our data and code in https://github.com/Kamino666/RethinkingVAD.

Paper Structure

This paper contains 21 sections, 12 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Three limitations in current metrics and benchmarks of video anomaly detection.
  • Figure 2: Left: The distribution of standard deviations for start time, duration, and end time (in seconds) across different annotations in three datasets. Right: The pair-wise Cohen's Kappa between annotators.
  • Figure 3: Visualization of discrepancies in UCF-Crime ucf-crime. True indicates agreement between annotations; Others denotes our annotation assigning the sample to another anomaly category; Normal denotes videos classified as normal.
  • Figure 4: Heatmap of predicted anomaly positions of different models in abnormal intervals. Brighter areas indicate higher density. The introducion of these models are in Sec. \ref{['sec:baselines']}.
  • Figure 5: Comparison between vanilla ROC/P-R curve and probabilistic ROC/P-R curve.
  • ...and 8 more figures