Table of Contents
Fetching ...

A Safe Reinforcement Learning Algorithm for Supervisory Control of Power Plants

Yixuan Sun, Sami Khairy, Richard B. Vilim, Rui Hu, Akshay J. Dave

TL;DR

This work proposes a chance-constrained RL algorithm based on Proximal Policy Optimization for supervisory control that achieves the smallest distance of violation and violation rate in a load-follow maneuver for an advanced Nuclear Power Plant design.

Abstract

Traditional control theory-based methods require tailored engineering for each system and constant fine-tuning. In power plant control, one often needs to obtain a precise representation of the system dynamics and carefully design the control scheme accordingly. Model-free Reinforcement learning (RL) has emerged as a promising solution for control tasks due to its ability to learn from trial-and-error interactions with the environment. It eliminates the need for explicitly modeling the environment's dynamics, which is potentially inaccurate. However, the direct imposition of state constraints in power plant control raises challenges for standard RL methods. To address this, we propose a chance-constrained RL algorithm based on Proximal Policy Optimization for supervisory control. Our method employs Lagrangian relaxation to convert the constrained optimization problem into an unconstrained objective, where trainable Lagrange multipliers enforce the state constraints. Our approach achieves the smallest distance of violation and violation rate in a load-follow maneuver for an advanced Nuclear Power Plant design.

A Safe Reinforcement Learning Algorithm for Supervisory Control of Power Plants

TL;DR

This work proposes a chance-constrained RL algorithm based on Proximal Policy Optimization for supervisory control that achieves the smallest distance of violation and violation rate in a load-follow maneuver for an advanced Nuclear Power Plant design.

Abstract

Traditional control theory-based methods require tailored engineering for each system and constant fine-tuning. In power plant control, one often needs to obtain a precise representation of the system dynamics and carefully design the control scheme accordingly. Model-free Reinforcement learning (RL) has emerged as a promising solution for control tasks due to its ability to learn from trial-and-error interactions with the environment. It eliminates the need for explicitly modeling the environment's dynamics, which is potentially inaccurate. However, the direct imposition of state constraints in power plant control raises challenges for standard RL methods. To address this, we propose a chance-constrained RL algorithm based on Proximal Policy Optimization for supervisory control. Our method employs Lagrangian relaxation to convert the constrained optimization problem into an unconstrained objective, where trainable Lagrange multipliers enforce the state constraints. Our approach achieves the smallest distance of violation and violation rate in a load-follow maneuver for an advanced Nuclear Power Plant design.
Paper Structure (21 sections, 24 equations, 9 figures, 2 tables)

This paper contains 21 sections, 24 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Application of RL for supervisory NPP Control using the proposed SAM-RL environment
  • Figure 2: Layout of the Advanced NPP studied in this work. Measurement locations of various states are annotated. The major actuators are listed in blue text below the component labels.
  • Figure 3: The training curves of the proposed models with MLP- and LSTM-actors. From left to right, it shows the changes in rewards, costs associated with two safety constraints, the magnitude of learned Lagrangian multipliers, and the policy entropy.
  • Figure 4: Visualization of agents' performance on a testing trajectory. a. & c. show the model performance of the MLP actor; b. & d. show the model performance of the LSTM actor. Both models resulted in compliance with safety constraints while closely following demand when possible. The MLP actor, in comparison, adhered more closely to the demand and safety constraints, whereas the LSTM actor led to fewer oscillations in actions. In c. & d., both trained RL agents adapt successfully to time-varying constraints. The resulting states adhere closely to the changing constraints and return to following the demand-induced trajectory when potential violations no longer present.
  • Figure 5: The visualization of the response of other state variables to load demand and $\lambda$-PPO agent. Compared to the reference governor, $\lambda$-PPO agent adheres to the constraints more closely without violation.
  • ...and 4 more figures