Scalable Temporal Anomaly Causality Discovery in Large Systems: Achieving Computational Efficiency with Binary Anomaly Flag Data
Mulugeta Weldezgina Asres, Christian Walter Omlin, The CMS-HCAL Collaboration
TL;DR
This work tackles the challenge of learning temporal causal graphs from large-scale binary anomaly flag data. It introduces AnomalyCD, a framework that combines online anomaly detection, anomaly-aware CI testing, sparse data/link compression, edge pruning, and Bayesian-network inference to achieve scalable, accurate GCMs for real-time RCA. Validation on CMS HCAL readout data and the EasyVista IT-monitoring dataset demonstrates substantial reductions in data and computation while improving causal-graph quality, confirming practical applicability in complex cyber-physical systems. The approach enables efficient, online root-cause analysis in environments with sparse, high-dimensional binary monitoring signals, and the authors provide open-source code for replication and extension.
Abstract
Extracting anomaly causality facilitates diagnostics once monitoring systems detect system faults. Identifying anomaly causes in large systems involves investigating a broader set of monitoring variables across multiple subsystems. However, learning graphical causal models (GCMs) comes with a significant computational burden that restrains the applicability of most existing methods in real-time and large-scale deployments. In addition, modern monitoring applications for large systems often generate large amounts of binary alarm flags, and the distinct characteristics of binary anomaly data -- the meaning of state transition and data sparsity -- challenge existing causality learning mechanisms. This study proposes an anomaly causal discovery approach (\textsc{AnomalyCD}), addressing the accuracy and computational challenges of generating GCMs from temporal binary flag datasets. The \textsc{AnomalyCD} presents several strategies, such as anomaly data-aware causality testing, sparse data and prior link compression, and edge pruning adjustment approaches. We validate the performance of of the approach on two datasets: monitoring sensor data of the readout-box system of the Compact Muon Solenoid experiment at CERN, and a public data set for information technology monitoring. The results on temporal GCMs demonstrate a considerable reduction of computation overhead and a moderate enhancement of accuracy on the binary anomaly data sets. Source code: https://github.com/muleina/AnomalyCD .
