Learning Reachability of Energy Storage Arbitrage
Tomás Tapia, Agustin Castellano, Enrique Mallada, Yury Dvorkin
TL;DR
The paper tackles the intertemporal reliability risk in energy storage arbitrage under uncertain prices by introducing stopping-time rewards and terminal SoC penalties to steer storage toward reserve readiness during critical hours. It develops three modeling approaches: a scenario-based Sample Averaged Approximation (SAA), a Deep Q-Learning (DQN) framework, and an End-to-End (E2E) framework that jointly trains price predictors and a dispatch policy under calibrated uncertainty sets. The key contributions are the integration of a stopping-time criterion with a terminal SoC constraint, a comparative evaluation showing that E2E achieves robust performance with low profit variance, and a demonstration of how reliability objectives can be embedded directly into learning-based dispatch. The findings highlight a practical path to enforcing reliability in storage operations while maintaining market-compatibility, particularly through conformal calibration and end-to-end task-driven training. Overall, the work advances reliability-aware energy arbitrage by combining reachability analysis, risk quantification, and learning-based decision making.
Abstract
Power systems face increasing weather-driven variability and, therefore, increasingly rely on flexible but energy-limited storage resources. Energy storage can buffer this variability, but its value depends on intertemporal decisions under uncertain prices. Without accounting for the future reliability value of stored energy, batteries may act myopically, discharging too early or failing to preserve reserves during critical hours. This paper introduces a stopping-time reward that, together with a state-of-charge (SoC) range target penalty, aligns arbitrage incentives with system reliability by rewarding storage that maintains sufficient SoC before critical hours. We formulate the problem as an online optimization with a chance-constrained terminal SoC and embed it in an end-to-end (E2E) learning framework, jointly training the price predictor and control policy. The proposed design enhances reachability of target SoC ranges, improves profit under volatile conditions, and reduces its standard deviation.
