Should we use model-free or model-based control? A case study of battery management systems

Mohamad Fares El Hajj Chehade; Young-ho Cho; Sandeep Chinchali; Hao Zhu

Should we use model-free or model-based control? A case study of battery management systems

Mohamad Fares El Hajj Chehade, Young-ho Cho, Sandeep Chinchali, Hao Zhu

TL;DR

The paper conducts a head-to-head benchmark between model-free reinforcement learning (RL) and model-based model predictive control (MPC) for a battery management system (BMS) that minimizes electricity cost subject to state-of-charge and power constraints. The MPC approach uses an LSTM forecaster to generate multi-step predictions and solves a horizon $T$ linear program, while RL is cast as a Markov decision process and optimized with proximal policy optimization (PPO) using a policy network and a value function. Across two datasets, RL delivers higher optimality and faster testing times, and shows robustness to distributional shifts in demand, at the cost of substantial training data; MPC is more data-efficient but its performance degrades with forecaster errors. The results provide practical guidance on when to favor RL or MPC in BMS applications and propose a benchmark framework for evaluating optimality, data efficiency, computation time, and robustness in energy-system control problems.

Abstract

Reinforcement learning (RL) and model predictive control (MPC) each offer distinct advantages and limitations when applied to control problems in power and energy systems. Despite various studies on these methods, benchmarks remain lacking and the preference for RL over traditional controls is not well understood. In this work, we put forth a comparative analysis using RL- and MPC-based controllers for optimizing a battery management system (BMS). The BMS problem aims to minimize costs while adhering to operational limits. by adjusting the battery (dis)charging in response to fluctuating electricity prices over a time horizon. The MPC controller uses a learningbased forecast of future demand and price changes to formulate a multi-period linear program, that can be solved using off-the-shelf solvers. Meanwhile, the RL controller requires no timeseries modeling but instead is trained from the sample trajectories using the proximal policy optimization (PPO) algorithm. Numerical tests compare these controllers across optimality, training time, testing time, and robustness, providing a comprehensive evaluation of their efficacy. RL not only yields optimal solutions quickly but also ensures robustness to shifts in customer behavior, such as changes in demand distribution. However, as expected, training the RL agent is more time-consuming than MPC.

Should we use model-free or model-based control? A case study of battery management systems

TL;DR

linear program, while RL is cast as a Markov decision process and optimized with proximal policy optimization (PPO) using a policy network and a value function. Across two datasets, RL delivers higher optimality and faster testing times, and shows robustness to distributional shifts in demand, at the cost of substantial training data; MPC is more data-efficient but its performance degrades with forecaster errors. The results provide practical guidance on when to favor RL or MPC in BMS applications and propose a benchmark framework for evaluating optimality, data efficiency, computation time, and robustness in energy-system control problems.

Abstract

Paper Structure (9 sections, 6 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 9 sections, 6 equations, 3 figures, 3 tables, 1 algorithm.

Introduction
Preliminaries
Model-Predictive Control (MPC)
Reinforcement Learning (RL)
Methodology
Numerical Comparisons
Dataset 1: Consistent Model through Training and Testing Sets
Dataset 2: Distributional Shifts in the Demand from the Training to the Testing Sets
Conclusion

Figures (3)

Figure 1: The components of the system. At a given time $t$, the load has a demand $d_t$ and the utility sells electricity at a price $\rho_t$. To minimize the total cost of purchasing electricity over a certain period, the battery controller chooses between charging from the grid ($a_t > 0$) and discharging to the load ($a_t <0$).
Figure 2: The policy network $\pi_{\theta}$. The state vector $s_t$ propagates through the neural network layers to output the probability masses $\pi_{\theta}(a|s_t) \: \forall a \in \mathcal{A}$. $\sigma$ is a non-linear activation function to allow the network to represent non-linear input-output relationships. A softmax function is applied at the output to convert the raw numbers (logits) to probabilities.
Figure 3: The shift in the distribution of the demand from the training to the testing datasets.

Should we use model-free or model-based control? A case study of battery management systems

TL;DR

Abstract

Should we use model-free or model-based control? A case study of battery management systems

Authors

TL;DR

Abstract

Table of Contents

Figures (3)