Table of Contents
Fetching ...

Benchmarking Anomaly Detection Across Heterogeneous Cloud Telemetry Datasets

Mohammad Saiful Islam, Andriy Miranskyy

TL;DR

This study evaluates four deep learning models, GRU, TCN, Transformer, and TSMixer, and uses a unified training and evaluation pipeline that demonstrates that anomaly detection performance in cloud systems is governed not only by model architecture, but critically by calibration stability and feature-space geometry.

Abstract

Anomaly detection is important for keeping cloud systems reliable and stable. Deep learning has improved time-series anomaly detection, but most models are evaluated on one dataset at a time. This raises questions about whether these models can handle different types of telemetry, especially in large-scale and high-dimensional environments. In this study, we evaluate four deep learning models, GRU, TCN, Transformer, and TSMixer. We also include Isolation Forest as a classical baseline. The models are tested across four telemetry datasets: the Numenta Anomaly Benchmark, Microsoft Cloud Monitoring dataset, Exathlon dataset, and IBM Console dataset. These datasets differ in structure, dimensionality, and labelling strategy. They include univariate time series, synthetic multivariate workloads, and real-world production telemetry with over 100,000 features. We use a unified training and evaluation pipeline across all datasets. The evaluation includes NAB-style metrics to capture early detection behaviour for datasets where anomalies persist over contiguous time intervals. This enables window-based scoring in settings where anomalies occur over contiguous time intervals, even when labels are recorded at the point level. The unified setup enables consistent analysis of model behaviour under shared scoring and calibration assumptions. Our results demonstrate that anomaly detection performance in cloud systems is governed not only by model architecture, but critically by calibration stability and feature-space geometry. By releasing our preprocessing pipelines, benchmark configuration, and evaluation artifacts, we aim to support reproducible and deployment-aware evaluation of anomaly detection systems for cloud environments.

Benchmarking Anomaly Detection Across Heterogeneous Cloud Telemetry Datasets

TL;DR

This study evaluates four deep learning models, GRU, TCN, Transformer, and TSMixer, and uses a unified training and evaluation pipeline that demonstrates that anomaly detection performance in cloud systems is governed not only by model architecture, but critically by calibration stability and feature-space geometry.

Abstract

Anomaly detection is important for keeping cloud systems reliable and stable. Deep learning has improved time-series anomaly detection, but most models are evaluated on one dataset at a time. This raises questions about whether these models can handle different types of telemetry, especially in large-scale and high-dimensional environments. In this study, we evaluate four deep learning models, GRU, TCN, Transformer, and TSMixer. We also include Isolation Forest as a classical baseline. The models are tested across four telemetry datasets: the Numenta Anomaly Benchmark, Microsoft Cloud Monitoring dataset, Exathlon dataset, and IBM Console dataset. These datasets differ in structure, dimensionality, and labelling strategy. They include univariate time series, synthetic multivariate workloads, and real-world production telemetry with over 100,000 features. We use a unified training and evaluation pipeline across all datasets. The evaluation includes NAB-style metrics to capture early detection behaviour for datasets where anomalies persist over contiguous time intervals. This enables window-based scoring in settings where anomalies occur over contiguous time intervals, even when labels are recorded at the point level. The unified setup enables consistent analysis of model behaviour under shared scoring and calibration assumptions. Our results demonstrate that anomaly detection performance in cloud systems is governed not only by model architecture, but critically by calibration stability and feature-space geometry. By releasing our preprocessing pipelines, benchmark configuration, and evaluation artifacts, we aim to support reproducible and deployment-aware evaluation of anomaly detection systems for cloud environments.
Paper Structure (56 sections, 5 equations, 7 figures, 5 tables)

This paper contains 56 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustration of likelihood-based anomaly detection. Reconstruction error is converted into a likelihood score using rolling windows and compared against a calibrated threshold. Ground-truth anomaly windows are used only for evaluation.
  • Figure 2: Ranking board over all 29 evaluated subsets using test-set NAB score. For each subset, models are ranked from best (Rank 1) to worst (Rank 5) based on their test-set NAB score using dense ranking. Subsets with no ground-truth anomalies, where all models correctly produce zero score, are counted as complete ties. Subsets with ground-truth anomalies in which no model produces any detection are excluded from Rank-1--Rank-5 counts and reported separately as a No-detection category (grey).
  • Figure 3: Calibration space for Microsoft telemetry. The horizontal and vertical axes correspond to long and short likelihood windows. Colors indicate models. Bubble size is proportional to the normalized NAB test score, highlighting configurations that achieve stable positive test performance under conservative likelihood calibration.
  • Figure 4: Distance to training centroid for app5__5_1_100000_64 using the GRU model. Each point represents the $L^2$ distance of a 2,283-dimensional feature vector to the training centroid. Training samples are shown in blue and test samples in orange. Shaded regions indicate ground-truth anomaly windows, and the dashed vertical line marks the train--test split. The strong overlap between training and test distances indicates stable geometry, enabling effective threshold generalization.
  • Figure 5: Distance to training centroid for app9__9_4_1000000_78 using the GRU model. While training samples remain close to the training centroid, test samples exhibit a systematic increase in distance, even outside anomaly windows. This geometric shift explains the large negative normalized NAB score despite correct anomaly detection.
  • ...and 2 more figures