Towards Generalizable Context-aware Anomaly Detection: A Large-scale Benchmark in Cloud Environments

Xinkai Zou; Xuan Jiang; Ruikai Huang; Haoze He; Parv Kapoor; Hongrui Wu; Yibo Wang; Jian Sha; Xiongbo Shi; Zixun Huang; Jinhua Zhao

Towards Generalizable Context-aware Anomaly Detection: A Large-scale Benchmark in Cloud Environments

Xinkai Zou, Xuan Jiang, Ruikai Huang, Haoze He, Parv Kapoor, Hongrui Wu, Yibo Wang, Jian Sha, Xiongbo Shi, Zixun Huang, Jinhua Zhao

TL;DR

This work addresses the need for context-aware anomaly detection in cloud environments by introducing CloudAnoBench, a large-scale multimodal benchmark with $1{,}252$ labeled cases across $28$ anomalous and $16$ normal scenarios, totaling roughly $200{,}000$ lines of synchronized metrics and logs ($90$ metric lines over $450$ seconds per case and $40$--$80$ aligned log entries). It also introduces CloudAnoAgent, an LLM-based, multi-agent system guided by a symbolic verifier to fuse metrics and logs for robust anomaly detection and scenario identification. Results show that CloudAnoAgent achieves up to approximately $0.20$ relative improvement in $F1$ over baselines and reduces false positives via the symbolic verifier, while demonstrating generalization to log-only datasets. Together, CloudAnoBench and CloudAnoAgent offer a principled framework for evaluating and advancing context-aware anomaly detection in real-world cloud systems, including deceptive normal scenarios that challenge detectors. This work lays foundational groundwork for reliable, multimodal anomaly reasoning with practical impact on cloud reliability and operations.

Abstract

Anomaly detection in cloud environments remains both critical and challenging. Existing context-level benchmarks typically focus on either metrics or logs and often lack reliable annotation, while most detection methods emphasize point anomalies within a single modality, overlooking contextual signals and limiting real-world applicability. Constructing a benchmark for context anomalies that combines metrics and logs is inherently difficult: reproducing anomalous scenarios on real servers is often infeasible or potentially harmful, while generating synthetic data introduces the additional challenge of maintaining cross-modal consistency. We introduce CloudAnoBench, a large-scale benchmark for context anomalies in cloud environments, comprising 28 anomalous scenarios and 16 deceptive normal scenarios, with 1,252 labeled cases and roughly 200,000 log and metric entries. Compared with prior benchmarks, CloudAnoBench exhibits higher ambiguity and greater difficulty, on which both prior machine learning methods and vanilla LLM prompting perform poorly. To demonstrate its utility, we further propose CloudAnoAgent, an LLM-based agent enhanced by symbolic verification that integrates metrics and logs. This agent system achieves substantial improvements in both anomaly detection and scenario identification on CloudAnoBench, and shows strong generalization to existing datasets. Together, CloudAnoBench and CloudAnoAgent lay the groundwork for advancing context-aware anomaly detection in cloud systems. Project Page: https://jayzou3773.github.io/cloudanobench-agent/

Towards Generalizable Context-aware Anomaly Detection: A Large-scale Benchmark in Cloud Environments

TL;DR

Abstract

Towards Generalizable Context-aware Anomaly Detection: A Large-scale Benchmark in Cloud Environments

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)