Control Policy Correction Framework for Reinforcement Learning-based Energy Arbitrage Strategies
Seyed Soroush Karimi Madahi, Gargya Gokhale, Marie-Sophie Verwee, Bert Claessens, Chris Develder
TL;DR
This paper addresses safe, revenue-maximizing energy arbitrage for batteries under a single-imbalance pricing scheme. It formulates the problem as a Markov decision process and employs distributional deep Q-learning to capture the full return distribution, enhancing risk sensitivity. A novel post-processing policy-correction via a differentiable optimization layer, implemented through knowledge distillation, enforces human-intuitive safety constraints without retraining from scratch, improving both interpretability and robustness. The framework is validated with Belgian 2023 prices and demonstrated on a real HomeLab battery, showing meaningful profit gains and practical deployability, albeit with some real-world latency and data-delivery limitations. Overall, the approach offers a reusable, safety-oriented enhancement to RL-based energy arbitrage that can be adapted across RL methods and settings.
Abstract
A continuous rise in the penetration of renewable energy sources, along with the use of the single imbalance pricing, provides a new opportunity for balance responsible parties to reduce their cost through energy arbitrage in the imbalance settlement mechanism. Model-free reinforcement learning (RL) methods are an appropriate choice for solving the energy arbitrage problem due to their outstanding performance in solving complex stochastic sequential problems. However, RL is rarely deployed in real-world applications since its learned policy does not necessarily guarantee safety during the execution phase. In this paper, we propose a new RL-based control framework for batteries to obtain a safe energy arbitrage strategy in the imbalance settlement mechanism. In our proposed control framework, the agent initially aims to optimize the arbitrage revenue. Subsequently, in the post-processing step, we correct (constrain) the learned policy following a knowledge distillation process based on properties that follow human intuition. Our post-processing step is a generic method and is not restricted to the energy arbitrage domain. We use the Belgian imbalance price of 2023 to evaluate the performance of our proposed framework. Furthermore, we deploy our proposed control framework on a real battery to show its capability in the real world.
