Table of Contents
Fetching ...

Scalable Temporal Anomaly Causality Discovery in Large Systems: Achieving Computational Efficiency with Binary Anomaly Flag Data

Mulugeta Weldezgina Asres, Christian Walter Omlin, The CMS-HCAL Collaboration

TL;DR

This work tackles the challenge of learning temporal causal graphs from large-scale binary anomaly flag data. It introduces AnomalyCD, a framework that combines online anomaly detection, anomaly-aware CI testing, sparse data/link compression, edge pruning, and Bayesian-network inference to achieve scalable, accurate GCMs for real-time RCA. Validation on CMS HCAL readout data and the EasyVista IT-monitoring dataset demonstrates substantial reductions in data and computation while improving causal-graph quality, confirming practical applicability in complex cyber-physical systems. The approach enables efficient, online root-cause analysis in environments with sparse, high-dimensional binary monitoring signals, and the authors provide open-source code for replication and extension.

Abstract

Extracting anomaly causality facilitates diagnostics once monitoring systems detect system faults. Identifying anomaly causes in large systems involves investigating a broader set of monitoring variables across multiple subsystems. However, learning graphical causal models (GCMs) comes with a significant computational burden that restrains the applicability of most existing methods in real-time and large-scale deployments. In addition, modern monitoring applications for large systems often generate large amounts of binary alarm flags, and the distinct characteristics of binary anomaly data -- the meaning of state transition and data sparsity -- challenge existing causality learning mechanisms. This study proposes an anomaly causal discovery approach (\textsc{AnomalyCD}), addressing the accuracy and computational challenges of generating GCMs from temporal binary flag datasets. The \textsc{AnomalyCD} presents several strategies, such as anomaly data-aware causality testing, sparse data and prior link compression, and edge pruning adjustment approaches. We validate the performance of of the approach on two datasets: monitoring sensor data of the readout-box system of the Compact Muon Solenoid experiment at CERN, and a public data set for information technology monitoring. The results on temporal GCMs demonstrate a considerable reduction of computation overhead and a moderate enhancement of accuracy on the binary anomaly data sets. Source code: https://github.com/muleina/AnomalyCD .

Scalable Temporal Anomaly Causality Discovery in Large Systems: Achieving Computational Efficiency with Binary Anomaly Flag Data

TL;DR

This work tackles the challenge of learning temporal causal graphs from large-scale binary anomaly flag data. It introduces AnomalyCD, a framework that combines online anomaly detection, anomaly-aware CI testing, sparse data/link compression, edge pruning, and Bayesian-network inference to achieve scalable, accurate GCMs for real-time RCA. Validation on CMS HCAL readout data and the EasyVista IT-monitoring dataset demonstrates substantial reductions in data and computation while improving causal-graph quality, confirming practical applicability in complex cyber-physical systems. The approach enables efficient, online root-cause analysis in environments with sparse, high-dimensional binary monitoring signals, and the authors provide open-source code for replication and extension.

Abstract

Extracting anomaly causality facilitates diagnostics once monitoring systems detect system faults. Identifying anomaly causes in large systems involves investigating a broader set of monitoring variables across multiple subsystems. However, learning graphical causal models (GCMs) comes with a significant computational burden that restrains the applicability of most existing methods in real-time and large-scale deployments. In addition, modern monitoring applications for large systems often generate large amounts of binary alarm flags, and the distinct characteristics of binary anomaly data -- the meaning of state transition and data sparsity -- challenge existing causality learning mechanisms. This study proposes an anomaly causal discovery approach (\textsc{AnomalyCD}), addressing the accuracy and computational challenges of generating GCMs from temporal binary flag datasets. The \textsc{AnomalyCD} presents several strategies, such as anomaly data-aware causality testing, sparse data and prior link compression, and edge pruning adjustment approaches. We validate the performance of of the approach on two datasets: monitoring sensor data of the readout-box system of the Compact Muon Solenoid experiment at CERN, and a public data set for information technology monitoring. The results on temporal GCMs demonstrate a considerable reduction of computation overhead and a moderate enhancement of accuracy on the binary anomaly data sets. Source code: https://github.com/muleina/AnomalyCD .

Paper Structure

This paper contains 31 sections, 18 equations, 17 figures, 8 tables, 1 algorithm.

Figures (17)

  • Figure 1: A TS with time lag effect $\mathbf{x}^1_{t - 1} \rightarrow \mathbf{x}^2$ and instantaneous effect $\mathbf{x}^1_t \rightarrow \mathbf{x}^3_t$peters2017elements.
  • Figure 2: Schematic of the CMS experiment focardi2012status
  • Figure 3: The frontend electronics of the HE data acquisition chain, including the SiPMs, the frontend readout cards, and the optical link connected to the back-end electronics strobbe2017upgrade. Each readout card contains twelve QIE11 for charge integration, an Igloo2 FPGA for data serialization and encoding, and a VTTx optical transmitter
  • Figure 4: The active mask of the LHC operation status from August to December of 2022. The active mask ($1$) refers to the LHC's normal operation run during collision experiment or idle, whereas the inactive mask ($0$) for non-physics operation states, e.g., the LHC's technical stop and maintenance
  • Figure 5: Sensor TS data from all RMs of the RBX-HEP07. The HEP07_i denotes the $i^{\text{th}}$ RM of the RBX
  • ...and 12 more figures

Theorems & Definitions (2)

  • Definition 1
  • Definition 2