AI-Enabled Operations at Fermi Complex: Multivariate Time Series Prediction for Outage Prediction and Diagnosis
Milan Jain, Burcu O. Mutlu, Caleb Stam, Jan Strube, Brian A. Schupbach, Jason M. St. John, William A. Pellico
TL;DR
The paper tackles reducing beam downtime at Fermilab by predicting outages from multivariate time-series data and automatically labeling outage causes. It systematically evaluates state-of-the-art multivariate time-series models (including $\text{LSTM}$, $\text{Transformer}$, and linear nets such as $N$-BEATS, $N$-HiTS, TiDE, and TSMixer) for beam-permit prediction, and introduces a Random Forest-based outage labeler using a bit-based and historical-aggregated feature pipeline. The results show that $\text{LSTM}$ provides the highest early-outage detection, while the Random Forest labeler achieves $82.1\%$ accuracy on operator-labeled outages and helps reduce unlabeled outages substantially. The work demonstrates deployment in FNAL control rooms and highlights interpretability, normalization, and data-loading challenges, pointing to future directions in continual/transfer learning to scale across time.
Abstract
The Main Control Room of the Fermilab accelerator complex continuously gathers extensive time-series data from thousands of sensors monitoring the beam. However, unplanned events such as trips or voltage fluctuations often result in beam outages, causing operational downtime. This downtime not only consumes operator effort in diagnosing and addressing the issue but also leads to unnecessary energy consumption by idle machines awaiting beam restoration. The current threshold-based alarm system is reactive and faces challenges including frequent false alarms and inconsistent outage-cause labeling. To address these limitations, we propose an AI-enabled framework that leverages predictive analytics and automated labeling. Using data from $2,703$ Linac devices and $80$ operator-labeled outages, we evaluate state-of-the-art deep learning architectures, including recurrent, attention-based, and linear models, for beam outage prediction. Additionally, we assess a Random Forest-based labeling system for providing consistent, confidence-scored outage annotations. Our findings highlight the strengths and weaknesses of these architectures for beam outage prediction and identify critical gaps that must be addressed to fully harness AI for transitioning downtime handling from reactive to predictive, ultimately reducing downtime and improving decision-making in accelerator management.
