Table of Contents
Fetching ...

Are Anomaly Scores Telling the Whole Story? A Benchmark for Multilevel Anomaly Detection

Tri Cao, Minh-Huy Trinh, Ailin Deng, Quoc-Nam Nguyen, Khoa Duong, Ngai-Man Cheung, Bryan Hooi

TL;DR

A novel setting is proposed, Multilevel AD (MAD), in which the anomaly score represents the severity of anomalies in real-world applications, and a novel benchmark, MAD-Bench, is introduced that evaluates models not only on their ability to detect anomalies, but also on how effectively their anomaly scores reflect severity.

Abstract

Anomaly detection (AD) is a machine learning task that identifies anomalies by learning patterns from normal training data. In many real-world scenarios, anomalies vary in severity, from minor anomalies with little risk to severe abnormalities requiring immediate attention. However, existing models primarily operate in a binary setting, and the anomaly scores they produce are usually based on the deviation of data points from normal data, which may not accurately reflect practical severity. In this paper, we address this gap by making three key contributions. First, we propose a novel setting, Multilevel AD (MAD), in which the anomaly score represents the severity of anomalies in real-world applications, and we highlight its diverse applications across various domains. Second, we introduce a novel benchmark, MAD-Bench, that evaluates models not only on their ability to detect anomalies, but also on how effectively their anomaly scores reflect severity. This benchmark incorporates multiple types of baselines and real-world applications involving severity. Finally, we conduct a comprehensive performance analysis on MAD-Bench. We evaluate models on their ability to assign severity-aligned scores, investigate the correspondence between their performance on binary and multilevel detection, and study their robustness. This analysis offers key insights into improving AD models for practical severity alignment. The code framework and datasets used for the benchmark will be made publicly available.

Are Anomaly Scores Telling the Whole Story? A Benchmark for Multilevel Anomaly Detection

TL;DR

A novel setting is proposed, Multilevel AD (MAD), in which the anomaly score represents the severity of anomalies in real-world applications, and a novel benchmark, MAD-Bench, is introduced that evaluates models not only on their ability to detect anomalies, but also on how effectively their anomaly scores reflect severity.

Abstract

Anomaly detection (AD) is a machine learning task that identifies anomalies by learning patterns from normal training data. In many real-world scenarios, anomalies vary in severity, from minor anomalies with little risk to severe abnormalities requiring immediate attention. However, existing models primarily operate in a binary setting, and the anomaly scores they produce are usually based on the deviation of data points from normal data, which may not accurately reflect practical severity. In this paper, we address this gap by making three key contributions. First, we propose a novel setting, Multilevel AD (MAD), in which the anomaly score represents the severity of anomalies in real-world applications, and we highlight its diverse applications across various domains. Second, we introduce a novel benchmark, MAD-Bench, that evaluates models not only on their ability to detect anomalies, but also on how effectively their anomaly scores reflect severity. This benchmark incorporates multiple types of baselines and real-world applications involving severity. Finally, we conduct a comprehensive performance analysis on MAD-Bench. We evaluate models on their ability to assign severity-aligned scores, investigate the correspondence between their performance on binary and multilevel detection, and study their robustness. This analysis offers key insights into improving AD models for practical severity alignment. The code framework and datasets used for the benchmark will be made publicly available.

Paper Structure

This paper contains 26 sections, 4 equations, 2 figures, 19 tables.

Figures (2)

  • Figure 1: (a) Binary Anomaly Detection classifies data as either in-distribution (ID) or out-of-distribution (OOD), without accounting for severity. (b) The Multilevel Anomaly Detection setting categorizes OOD data by severity, reflecting the potential impact or risk. For instance, in COVID-19 chest X-rays, severity increases with greater lung involvement, from mild ground-glass opacities to extensive consolidation, indicating heightened clinical urgency.
  • Figure 2: MLLM-based baselines enable few-shot multilevel AD by leveraging their domain knowledge without fine-tuning, unlike conventional baselines requiring target data training.