Table of Contents
Fetching ...

Economic Battery Storage Dispatch with Deep Reinforcement Learning from Rule-Based Demonstrations

Manuel Sage, Martin Staniszewski, Yaoyao Fiona Zhao

TL;DR

The paper tackles economic battery dispatch under price uncertainty with long horizons and sparse rewards. It extends Soft Actor-Critic with demonstrations by using a second replay buffer and linearly decaying sampling to incorporate cheaply generated, rule-based demonstrations (SACfD). The case study on a grid-connected microgrid shows SACfD dramatically improves sample efficiency and final rewards over baselines, and even surpasses the demonstrator under a range of rule qualities. The findings highlight the practical potential of using imperfect, rule-based demonstrations to enable data-efficient RL for energy storage dispatch in settings where expert data is unavailable.

Abstract

The application of deep reinforcement learning algorithms to economic battery dispatch problems has significantly increased recently. However, optimizing battery dispatch over long horizons can be challenging due to delayed rewards. In our experiments we observe poor performance of popular actor-critic algorithms when trained on yearly episodes with hourly resolution. To address this, we propose an approach extending soft actor-critic (SAC) with learning from demonstrations. The special feature of our approach is that, due to the absence of expert demonstrations, the demonstration data is generated through simple, rule-based policies. We conduct a case study on a grid-connected microgrid and use if-then-else statements based on the wholesale price of electricity to collect demonstrations. These are stored in a separate replay buffer and sampled with linearly decaying probability along with the agent's own experiences. Despite these minimal modifications and the imperfections in the demonstration data, the results show a drastic performance improvement regarding both sample efficiency and final rewards. We further show that the proposed method reliably outperforms the demonstrator and is robust to the choice of rule, as long as the rule is sufficient to guide early training into the right direction.

Economic Battery Storage Dispatch with Deep Reinforcement Learning from Rule-Based Demonstrations

TL;DR

The paper tackles economic battery dispatch under price uncertainty with long horizons and sparse rewards. It extends Soft Actor-Critic with demonstrations by using a second replay buffer and linearly decaying sampling to incorporate cheaply generated, rule-based demonstrations (SACfD). The case study on a grid-connected microgrid shows SACfD dramatically improves sample efficiency and final rewards over baselines, and even surpasses the demonstrator under a range of rule qualities. The findings highlight the practical potential of using imperfect, rule-based demonstrations to enable data-efficient RL for energy storage dispatch in settings where expert data is unavailable.

Abstract

The application of deep reinforcement learning algorithms to economic battery dispatch problems has significantly increased recently. However, optimizing battery dispatch over long horizons can be challenging due to delayed rewards. In our experiments we observe poor performance of popular actor-critic algorithms when trained on yearly episodes with hourly resolution. To address this, we propose an approach extending soft actor-critic (SAC) with learning from demonstrations. The special feature of our approach is that, due to the absence of expert demonstrations, the demonstration data is generated through simple, rule-based policies. We conduct a case study on a grid-connected microgrid and use if-then-else statements based on the wholesale price of electricity to collect demonstrations. These are stored in a separate replay buffer and sampled with linearly decaying probability along with the agent's own experiences. Despite these minimal modifications and the imperfections in the demonstration data, the results show a drastic performance improvement regarding both sample efficiency and final rewards. We further show that the proposed method reliably outperforms the demonstrator and is robust to the choice of rule, as long as the rule is sufficient to guide early training into the right direction.

Paper Structure

This paper contains 7 sections, 10 equations, 3 figures, 1 table, 1 algorithm.

Figures (3)

  • Figure 1: Simplified model of the grid-connected microgrid.
  • Figure 2: Demand and RE power production during a random week in 2018.
  • Figure 3: Results of the conducted experiments. All experiments were repeated over five independent runs. The shaded area corresponds to +/- one standard deviation. (LEFT) Comparing the accumulated episodic reward of SACfD to three benchmark algorithms for long training periods of 200 episodes. (MIDDLE) Comparing SACfD's linearly decaying sampling ratio between demonstration and experience buffer to exponential decay rates. (RIGHT) Opposing the performance for rules with different decision thresholds to the performance of SACfD when trained with demonstration data from these rules.