Table of Contents
Fetching ...

Beyond PID Controllers: PPO with Neuralized PID Policy for Proton Beam Intensity Control in Mu2e

Chenwei Xu, Jerry Yao-Chieh Hu, Aakaash Narayanan, Mattson Thieme, Vladimir Nagaslaev, Mark Austin, Jeremy Arnold, Jose Berlioz, Pierrick Hanlet, Aisha Ibrahim, Dennis Nicklaus, Jovan Mitrevski, Jason Michael St. John, Gauri Pradhan, Andrea Saewert, Kiyomi Seiya, Brian Schupbach, Randy Thurman-Keup, Nhan Tran, Rui Shi, Seda Ogrenci, Alexis Maya-Isabelle Shuping, Kyle Hazelwood, Han Liu

TL;DR

This work tackles the challenge of achieving uniform proton spill intensity for Mu2e by modeling the accelerator as a Markov Decision Process and applying Proximal Policy Optimization (PPO) to a policy that combines a neuralized PID bias with learnable actions. The approach uses an EMA-based reward and a differentiable Mu2e simulator to train a controller whose action at time t is the sum of a PID term and a learned correction, $a_t = \pi(s_t) = \pi^{\text{PID}}(o_t) + \pi^{\text{action}}(a_{t-1})$, enabling real-time spill regulation. Empirical results show a 13.6% improvement in Spill Duty Factor (SDF) over unregulated spills and a 1.6% improvement over a PID baseline across multiple random seeds, indicating that integrating model-based inductive bias with RL can enhance accelerator control. The findings advance automated, real-time proton beam intensity control for Mu2e, with practical implications for achieving stable beam delivery and reducing background in precision physics experiments.

Abstract

We introduce a novel Proximal Policy Optimization (PPO) algorithm aimed at addressing the challenge of maintaining a uniform proton beam intensity delivery in the Muon to Electron Conversion Experiment (Mu2e) at Fermi National Accelerator Laboratory (Fermilab). Our primary objective is to regulate the spill process to ensure a consistent intensity profile, with the ultimate goal of creating an automated controller capable of providing real-time feedback and calibration of the Spill Regulation System (SRS) parameters on a millisecond timescale. We treat the Mu2e accelerator system as a Markov Decision Process suitable for Reinforcement Learning (RL), utilizing PPO to reduce bias and enhance training stability. A key innovation in our approach is the integration of a neuralized Proportional-Integral-Derivative (PID) controller into the policy function, resulting in a significant improvement in the Spill Duty Factor (SDF) by 13.6%, surpassing the performance of the current PID controller baseline by an additional 1.6%. This paper presents the preliminary offline results based on a differentiable simulator of the Mu2e accelerator. It paves the groundwork for real-time implementations and applications, representing a crucial step towards automated proton beam intensity control for the Mu2e experiment.

Beyond PID Controllers: PPO with Neuralized PID Policy for Proton Beam Intensity Control in Mu2e

TL;DR

This work tackles the challenge of achieving uniform proton spill intensity for Mu2e by modeling the accelerator as a Markov Decision Process and applying Proximal Policy Optimization (PPO) to a policy that combines a neuralized PID bias with learnable actions. The approach uses an EMA-based reward and a differentiable Mu2e simulator to train a controller whose action at time t is the sum of a PID term and a learned correction, , enabling real-time spill regulation. Empirical results show a 13.6% improvement in Spill Duty Factor (SDF) over unregulated spills and a 1.6% improvement over a PID baseline across multiple random seeds, indicating that integrating model-based inductive bias with RL can enhance accelerator control. The findings advance automated, real-time proton beam intensity control for Mu2e, with practical implications for achieving stable beam delivery and reducing background in precision physics experiments.

Abstract

We introduce a novel Proximal Policy Optimization (PPO) algorithm aimed at addressing the challenge of maintaining a uniform proton beam intensity delivery in the Muon to Electron Conversion Experiment (Mu2e) at Fermi National Accelerator Laboratory (Fermilab). Our primary objective is to regulate the spill process to ensure a consistent intensity profile, with the ultimate goal of creating an automated controller capable of providing real-time feedback and calibration of the Spill Regulation System (SRS) parameters on a millisecond timescale. We treat the Mu2e accelerator system as a Markov Decision Process suitable for Reinforcement Learning (RL), utilizing PPO to reduce bias and enhance training stability. A key innovation in our approach is the integration of a neuralized Proportional-Integral-Derivative (PID) controller into the policy function, resulting in a significant improvement in the Spill Duty Factor (SDF) by 13.6%, surpassing the performance of the current PID controller baseline by an additional 1.6%. This paper presents the preliminary offline results based on a differentiable simulator of the Mu2e accelerator. It paves the groundwork for real-time implementations and applications, representing a crucial step towards automated proton beam intensity control for the Mu2e experiment.
Paper Structure (10 sections, 4 equations, 3 figures, 1 table)

This paper contains 10 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: (a): The extraction (or ‘spill’) of protons from the Delivery Ring is noisy (deviates from 1) without any regulation. (b): A snapshot of the beam in physical space at the extraction location. As the horizontal beam size increases, a slice of circulating beam (that is past the position of the electrostatic septum) is extracted. (c): To create the muons, proton pulses are made to hit a production target and muons are obtained from the secondaries. The proton pulses with the required time structure are created by extracting them from an accelerator ring called Delivery Ring at Fermilab and sending it to the Mu2e production target.
  • Figure 2: (a): The Mu2e simulator initially generates the noised spill data. The agent proceeds to adjust the spill and the code employs this adjusted spill to compute relevant information such as the state and reward. These pieces of information are instrumental in training the RL agent, which in turn offers new actions for the subsequent time step to refine the spill. (b): The simulator refines (corrects) the spill derived from noisy data. It conveys the state and reward, calculated using the corrected spill, to update the value network responsible for evaluating the quality of the correction. Subsequently, the state and loss generated by the value network contribute to the adaptation of the policy network. The policy network, in response, generates new actions for spill regulation.
  • Figure 3: Spill intensity and SDF comparison in different seeds. (LHS): Comparison of Spill Intensity: The spill intensity corrected by RL is closer to 1 when compared to the PID-corrected spill. (RHS): Comparison of SDF: After 600 training iterations, the SDF achieved by RL outperforms or nears the SDF obtained through PID.