Table of Contents
Fetching ...

AI-Enabled Operations at Fermi Complex: Multivariate Time Series Prediction for Outage Prediction and Diagnosis

Milan Jain, Burcu O. Mutlu, Caleb Stam, Jan Strube, Brian A. Schupbach, Jason M. St. John, William A. Pellico

TL;DR

The paper tackles reducing beam downtime at Fermilab by predicting outages from multivariate time-series data and automatically labeling outage causes. It systematically evaluates state-of-the-art multivariate time-series models (including $\text{LSTM}$, $\text{Transformer}$, and linear nets such as $N$-BEATS, $N$-HiTS, TiDE, and TSMixer) for beam-permit prediction, and introduces a Random Forest-based outage labeler using a bit-based and historical-aggregated feature pipeline. The results show that $\text{LSTM}$ provides the highest early-outage detection, while the Random Forest labeler achieves $82.1\%$ accuracy on operator-labeled outages and helps reduce unlabeled outages substantially. The work demonstrates deployment in FNAL control rooms and highlights interpretability, normalization, and data-loading challenges, pointing to future directions in continual/transfer learning to scale across time.

Abstract

The Main Control Room of the Fermilab accelerator complex continuously gathers extensive time-series data from thousands of sensors monitoring the beam. However, unplanned events such as trips or voltage fluctuations often result in beam outages, causing operational downtime. This downtime not only consumes operator effort in diagnosing and addressing the issue but also leads to unnecessary energy consumption by idle machines awaiting beam restoration. The current threshold-based alarm system is reactive and faces challenges including frequent false alarms and inconsistent outage-cause labeling. To address these limitations, we propose an AI-enabled framework that leverages predictive analytics and automated labeling. Using data from $2,703$ Linac devices and $80$ operator-labeled outages, we evaluate state-of-the-art deep learning architectures, including recurrent, attention-based, and linear models, for beam outage prediction. Additionally, we assess a Random Forest-based labeling system for providing consistent, confidence-scored outage annotations. Our findings highlight the strengths and weaknesses of these architectures for beam outage prediction and identify critical gaps that must be addressed to fully harness AI for transitioning downtime handling from reactive to predictive, ultimately reducing downtime and improving decision-making in accelerator management.

AI-Enabled Operations at Fermi Complex: Multivariate Time Series Prediction for Outage Prediction and Diagnosis

TL;DR

The paper tackles reducing beam downtime at Fermilab by predicting outages from multivariate time-series data and automatically labeling outage causes. It systematically evaluates state-of-the-art multivariate time-series models (including , , and linear nets such as -BEATS, -HiTS, TiDE, and TSMixer) for beam-permit prediction, and introduces a Random Forest-based outage labeler using a bit-based and historical-aggregated feature pipeline. The results show that provides the highest early-outage detection, while the Random Forest labeler achieves accuracy on operator-labeled outages and helps reduce unlabeled outages substantially. The work demonstrates deployment in FNAL control rooms and highlights interpretability, normalization, and data-loading challenges, pointing to future directions in continual/transfer learning to scale across time.

Abstract

The Main Control Room of the Fermilab accelerator complex continuously gathers extensive time-series data from thousands of sensors monitoring the beam. However, unplanned events such as trips or voltage fluctuations often result in beam outages, causing operational downtime. This downtime not only consumes operator effort in diagnosing and addressing the issue but also leads to unnecessary energy consumption by idle machines awaiting beam restoration. The current threshold-based alarm system is reactive and faces challenges including frequent false alarms and inconsistent outage-cause labeling. To address these limitations, we propose an AI-enabled framework that leverages predictive analytics and automated labeling. Using data from Linac devices and operator-labeled outages, we evaluate state-of-the-art deep learning architectures, including recurrent, attention-based, and linear models, for beam outage prediction. Additionally, we assess a Random Forest-based labeling system for providing consistent, confidence-scored outage annotations. Our findings highlight the strengths and weaknesses of these architectures for beam outage prediction and identify critical gaps that must be addressed to fully harness AI for transitioning downtime handling from reactive to predictive, ultimately reducing downtime and improving decision-making in accelerator management.
Paper Structure (38 sections, 3 equations, 6 figures, 3 tables)

This paper contains 38 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview diagram illustrating: (a) the Fermilab accelerator complex with sample device data collected from the Linac, (b) current operations in the FNAL control room, highlighting key limitations, and (c) the proposed predictive maintenance pipeline along with its potential benefits.
  • Figure 2: Distribution of outage duration by class. The duration is limited to 60 minutes by the size of a single data file.
  • Figure 3: Model-wise detection rate of outage types.
  • Figure 4: Confusion Matrix for Random Forest Classifier
  • Figure 5: Comparison of RF Labeler with Bit Labeler
  • ...and 1 more figures