Table of Contents
Fetching ...

Reasoning-based Anomaly Detection Framework: A Real-time, Scalable, and Automated Approach to Anomaly Detection Across Domains

Anupam Panwar, Himadri Pal, Jiali Chen, Kyle Cho, Riddick Jiang, Miao Zhao, Rajiv Krishnamurthy

TL;DR

RADF presents a Reasoning-based Anomaly Detection Framework that unifies automated model selection, real-time root-cause analysis, and a configurable orchestration layer to tackle scalable anomaly detection across heterogeneous time-series. The mSelect component automates algorithm and hyperparameter choices by classifying time-series patterns into Stable, Unstable, or Trend and selecting from an ensemble of detectors, while the RCA module performs cross-dimension and cross-metric causal/correlation analyses to explain anomalies. Empirical results on the TSB-UAD benchmark and internal datasets show RADF achieving strong AUC and robust VUS performance, with high precision/recall on several time-series classes, and practical deployment success in production for batch and streaming contexts. The framework emphasizes scalability, interpretability, and minimal manual intervention, offering a practical, enterprise-grade solution for real-world anomaly detection across domains.

Abstract

Detecting anomalies in large, distributed systems presents several challenges. The first challenge arises from the sheer volume of data that needs to be processed. Flagging anomalies in a high-throughput environment calls for a careful consideration of both algorithm and system design. The second challenge comes from the heterogeneity of time-series datasets that leverage such a system in production. In practice, anomaly detection systems are rarely deployed for a single use case. Typically, there are several metrics to monitor, often across several domains (e.g. engineering, business and operations). A one-size-fits-all approach rarely works, so these systems need to be fine-tuned for every application - this is often done manually. The third challenge comes from the fact that determining the root-cause of anomalies in such settings is akin to finding a needle in a haystack. Identifying (in real time) a time-series dataset that is associated causally with the anomalous time-series data is a very difficult problem. In this paper, we describe a unified framework that addresses these challenges. Reasoning based Anomaly Detection Framework (RADF) is designed to perform real time anomaly detection on very large datasets. This framework employs a novel technique (mSelect) that automates the process of algorithm selection and hyper-parameter tuning for each use case. Finally, it incorporates a post-detection capability that allows for faster triaging and root-cause determination. Our extensive experiments demonstrate that RADF, powered by mSelect, surpasses state-of-the-art anomaly detection models in AUC performance for 5 out of 9 public benchmarking datasets. RADF achieved an AUC of over 0.85 for 7 out of 9 datasets, a distinction unmatched by any other state-of-the-art model.

Reasoning-based Anomaly Detection Framework: A Real-time, Scalable, and Automated Approach to Anomaly Detection Across Domains

TL;DR

RADF presents a Reasoning-based Anomaly Detection Framework that unifies automated model selection, real-time root-cause analysis, and a configurable orchestration layer to tackle scalable anomaly detection across heterogeneous time-series. The mSelect component automates algorithm and hyperparameter choices by classifying time-series patterns into Stable, Unstable, or Trend and selecting from an ensemble of detectors, while the RCA module performs cross-dimension and cross-metric causal/correlation analyses to explain anomalies. Empirical results on the TSB-UAD benchmark and internal datasets show RADF achieving strong AUC and robust VUS performance, with high precision/recall on several time-series classes, and practical deployment success in production for batch and streaming contexts. The framework emphasizes scalability, interpretability, and minimal manual intervention, offering a practical, enterprise-grade solution for real-world anomaly detection across domains.

Abstract

Detecting anomalies in large, distributed systems presents several challenges. The first challenge arises from the sheer volume of data that needs to be processed. Flagging anomalies in a high-throughput environment calls for a careful consideration of both algorithm and system design. The second challenge comes from the heterogeneity of time-series datasets that leverage such a system in production. In practice, anomaly detection systems are rarely deployed for a single use case. Typically, there are several metrics to monitor, often across several domains (e.g. engineering, business and operations). A one-size-fits-all approach rarely works, so these systems need to be fine-tuned for every application - this is often done manually. The third challenge comes from the fact that determining the root-cause of anomalies in such settings is akin to finding a needle in a haystack. Identifying (in real time) a time-series dataset that is associated causally with the anomalous time-series data is a very difficult problem. In this paper, we describe a unified framework that addresses these challenges. Reasoning based Anomaly Detection Framework (RADF) is designed to perform real time anomaly detection on very large datasets. This framework employs a novel technique (mSelect) that automates the process of algorithm selection and hyper-parameter tuning for each use case. Finally, it incorporates a post-detection capability that allows for faster triaging and root-cause determination. Our extensive experiments demonstrate that RADF, powered by mSelect, surpasses state-of-the-art anomaly detection models in AUC performance for 5 out of 9 public benchmarking datasets. RADF achieved an AUC of over 0.85 for 7 out of 9 datasets, a distinction unmatched by any other state-of-the-art model.

Paper Structure

This paper contains 21 sections, 1 equation, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Anomaly Detection Orchestrator Architecture
  • Figure 2: Workflow for Root Cause Analysis
  • Figure 3: mSelect Stable Time Series Examples
  • Figure 4: mSelect Unstable Time Series Examples
  • Figure 5: mSelect Trend Time Series Examples
  • ...and 2 more figures