Autoencoder-based Anomaly Detection System for Online Data Quality Monitoring of the CMS Electromagnetic Calorimeter
The CMS ECAL Collaboration
TL;DR
This work addresses the challenge of real-time data-quality monitoring for CMS ECAL by introducing a semi-supervised, autoencoder-based anomaly detection system that operates on occupancy images. The method leverages spatial response and time-evolution corrections to dramatically improve detection efficiency while keeping false alarms low, achieving near-100% anomaly capture at a low False Discovery Rate. It is trained on abundant good data with fake anomalies used for threshold calibration and validated on real anomalies from 2018 and 2022, demonstrating robust localization at the tower level. The system has been deployed in Run 3 within the CMS software framework (CMSSW) using ONNX Runtime, enabling real-time ML-quality plots that complement traditional DQM, detect degrading channels, and generalize to other detector subsystems or experiments.
Abstract
The CMS detector is a general-purpose apparatus that detects high-energy collisions produced at the LHC. Online Data Quality Monitoring of the CMS electromagnetic calorimeter is a vital operational tool that allows detector experts to quickly identify, localize, and diagnose a broad range of detector issues that could affect the quality of physics data. A real-time autoencoder-based anomaly detection system using semi-supervised machine learning is presented enabling the detection of anomalies in the CMS electromagnetic calorimeter data. A novel method is introduced which maximizes the anomaly detection performance by exploiting the time-dependent evolution of anomalies as well as spatial variations in the detector response. The autoencoder-based system is able to efficiently detect anomalies, while maintaining a very low false discovery rate. The performance of the system is validated with anomalies found in 2018 and 2022 LHC collision data. Additionally, the first results from deploying the autoencoder-based system in the CMS online Data Quality Monitoring workflow during the beginning of Run 3 of the LHC are presented, showing its ability to detect issues missed by the existing system.
