Time-Varying Constraint-Aware Reinforcement Learning for Energy Storage Control
Jaeik Jeong, Tai-Yeon Ku, Wan-Ki Park
TL;DR
This work tackles energy storage control under time-varying SoC-driven action constraints, where discrete RL and standard continuous RL struggle to explore without hitting full charge/discharge states. It introduces a continuous PPO framework with an LSTM that incorporates a supervising objective enforcing the policy mean within the feasible action interval $[\bar{P}_{c,t}, \bar{P}_{d,t}]$ via the loss term $L^{PPO}_{supervising}(\theta)$, yielding a combined objective $L^{PPO}(\theta)=L^{PPO}_{actor}(\theta)+C_1 L^{PPO}_{critic}(\theta)+C_2 L^{PPO}_{supervising}(\theta)$. The method is validated on energy arbitrage with a 100 MWh battery and SoC bounds, showing that Case 3 (proposed) achieves the highest profit by actively utilizing storage while avoiding suboptimal dead states, outperforming a purely unconstrained approach and a penalty-based constraint method. This approach has practical implications for grid stability and renewable integration, and sets the stage for future offline and multi-agent RL extensions to further stabilize policies.
Abstract
Energy storage devices, such as batteries, thermal energy storages, and hydrogen systems, can help mitigate climate change by ensuring a more stable and sustainable power supply. To maximize the effectiveness of such energy storage, determining the appropriate charging and discharging amounts for each time period is crucial. Reinforcement learning is preferred over traditional optimization for the control of energy storage due to its ability to adapt to dynamic and complex environments. However, the continuous nature of charging and discharging levels in energy storage poses limitations for discrete reinforcement learning, and time-varying feasible charge-discharge range based on state of charge (SoC) variability also limits the conventional continuous reinforcement learning. In this paper, we propose a continuous reinforcement learning approach that takes into account the time-varying feasible charge-discharge range. An additional objective function was introduced for learning the feasible action range for each time period, supplementing the objectives of training the actor for policy learning and the critic for value learning. This actively promotes the utilization of energy storage by preventing them from getting stuck in suboptimal states, such as continuous full charging or discharging. This is achieved through the enforcement of the charging and discharging levels into the feasible action range. The experimental results demonstrated that the proposed method further maximized the effectiveness of energy storage by actively enhancing its utilization.
