Table of Contents
Fetching ...

Assessing Long-Term Electricity Market Design for Ambitious Decarbonization Targets using Multi-Agent Reinforcement Learning

Javier Gonzalez-Ruiz, Carlos Rodriguez-Pardo, Iacopo Savelli, Alice Di Bella, Massimo Tavoni

TL;DR

The paper develops a scalable, open-source multi-agent reinforcement learning framework to evaluate long-term electricity market designs under ambitious decarbonization targets. It uses Independent Proximal Policy Optimization (IPPO) to train profit-maximizing Generation Companies (GENCOs) across three mutually exclusive investment channels (Merchant, CfD, Capacity Market) within a stylized Italian system, incorporating representative days and a copper-plate grid. The study demonstrates how different market designs and policy instruments shape investment, emissions, and price outcomes, highlighting the critical role of long-term market design in enabling decarbonization while mitigating price volatility. It also discusses hyperparameter strategies, limitations of independent learning in MARL, and avenues for extending the framework to include more constraints, risk preferences, and regulator-style agents for policy analysis. Overall, the framework provides a flexible tool for policymakers to stress-test hybrid market designs and long-term incentives in the energy transition landscape.

Abstract

Electricity systems are key to transforming today's society into a carbon-free economy. Long-term electricity market mechanisms, including auctions, support schemes, and other policy instruments, are critical in shaping the electricity generation mix. In light of the need for more advanced tools to support policymakers and other stakeholders in designing, testing, and evaluating long-term markets, this work presents a multi-agent reinforcement learning model capable of capturing the key features of decarbonizing energy systems. Profit-maximizing generation companies make investment decisions in the wholesale electricity market, responding to system needs, competitive dynamics, and policy signals. The model employs independent proximal policy optimization, which was selected for suitability to the decentralized and competitive environment. Nevertheless, given the inherent challenges of independent learning in multi-agent settings, an extensive hyperparameter search ensures that decentralized training yields market outcomes consistent with competitive behavior. The model is applied to a stylized version of the Italian electricity system and tested under varying levels of competition, market designs, and policy scenarios. Results highlight the critical role of market design for decarbonizing the electricity sector and avoiding price volatility. The proposed framework allows assessing long-term electricity markets in which multiple policy and market mechanisms interact simultaneously, with market participants responding and adapting to decarbonization pathways.

Assessing Long-Term Electricity Market Design for Ambitious Decarbonization Targets using Multi-Agent Reinforcement Learning

TL;DR

The paper develops a scalable, open-source multi-agent reinforcement learning framework to evaluate long-term electricity market designs under ambitious decarbonization targets. It uses Independent Proximal Policy Optimization (IPPO) to train profit-maximizing Generation Companies (GENCOs) across three mutually exclusive investment channels (Merchant, CfD, Capacity Market) within a stylized Italian system, incorporating representative days and a copper-plate grid. The study demonstrates how different market designs and policy instruments shape investment, emissions, and price outcomes, highlighting the critical role of long-term market design in enabling decarbonization while mitigating price volatility. It also discusses hyperparameter strategies, limitations of independent learning in MARL, and avenues for extending the framework to include more constraints, risk preferences, and regulator-style agents for policy analysis. Overall, the framework provides a flexible tool for policymakers to stress-test hybrid market designs and long-term incentives in the energy transition landscape.

Abstract

Electricity systems are key to transforming today's society into a carbon-free economy. Long-term electricity market mechanisms, including auctions, support schemes, and other policy instruments, are critical in shaping the electricity generation mix. In light of the need for more advanced tools to support policymakers and other stakeholders in designing, testing, and evaluating long-term markets, this work presents a multi-agent reinforcement learning model capable of capturing the key features of decarbonizing energy systems. Profit-maximizing generation companies make investment decisions in the wholesale electricity market, responding to system needs, competitive dynamics, and policy signals. The model employs independent proximal policy optimization, which was selected for suitability to the decentralized and competitive environment. Nevertheless, given the inherent challenges of independent learning in multi-agent settings, an extensive hyperparameter search ensures that decentralized training yields market outcomes consistent with competitive behavior. The model is applied to a stylized version of the Italian electricity system and tested under varying levels of competition, market designs, and policy scenarios. Results highlight the critical role of market design for decarbonizing the electricity sector and avoiding price volatility. The proposed framework allows assessing long-term electricity markets in which multiple policy and market mechanisms interact simultaneously, with market participants responding and adapting to decarbonization pathways.

Paper Structure

This paper contains 45 sections, 25 equations, 33 figures, 14 tables.

Figures (33)

  • Figure 1: Structure of the long-term electricity market in the reinforcement learning model. The diagram highlights the interactions between GENCO agents and the market environment. From the market, agents receive observations (public information, published by entities such as Market Operators and TSOs, and private information, regarding the performance of their portfolio), to take actions (investment decisions) in the system, with the aim to maximize their profits during the simulation. The figure illustrates one year of market operation, where agents participate in the short-term market at each environment step while making investment decisions annually. Investment decisions occur sequentially through mutually exclusive entry mechanisms (merchant, Capacity Market, CfD market), with the latter two depending on system balance calculations, as detailed in Section \ref{['Section - Long-term Electricity Market Environment']}.
  • Figure 2: Evolution of aggregated and individual reward during training for Hyperparameter configurations M7 (MLP network) and L1 (LSTM network). The upper graphs present results for the EoM environment, while the lower ones concentrate on the Capacity plus CfD market. The left panels show the aggregate reward for the system. Panels in the second to the fourth columns present the reward evolution during training for Agent 1 (Incumbent), Agent 8 (Entrant - Solar PV), and Agent 15 (mid-term storage operation), respectively. Results are normalized according to the relative wall time used for training. Solid curves indicate average values obtained during sampling, while shaded areas represent minimum and maximum values.
  • Figure 3: Average Aggregated Reward for different hyperparameter configurations in the EoM and CM + CfD environments. Results are obtained using 100 episodes in the environment with the most updated agents' versions. Hatched bars indicate the outcome for the EoM environment, while solid bars represent the CM + CfD environment. Error bars showcased the 10th and 90th percentiles from the 100 episodes. The Y axes in the Figure are adjusted to facilitate comparison among the most relevant hyperparameter configurations.
  • Figure 4: Average Penalty, HHI index, and League Ranking for hyperparameter configurations in the EoM (A) and CM + CfD (B) environments. Penalty and HHI index Results are obtained using 100 episodes in the environment with the most updated agents' versions. League Ranking is obtained from the competition set between all hyperparameter configurations per environment, where agents rank according to their overall performance between 1 (best) and 26 (worst). To facilitate visualization in the polar plot, all metrics undergo zero-centered median normalization, are clipped between 0 and 2, and are shifted by one unit.
  • Figure 5: Average Penalty, HHI index, and League Ranking for hyperparameter configurations from the ablation study in the EoM (A) and CM + CfD (B) environments. Penalty and HHI index Results are obtained using 100 episodes in the environment with the most updated agents' versions. League Score is obtained from the competition set between the hyperparameter configurations from the ablation study per environment, where agents are scored according to their overall performance between 0 (best) and 1 (worst). To facilitate visualization in the polar plot, all metrics undergo zero-centered mean normalization, are clipped between 0 and 2, and are shifted by one unit.
  • ...and 28 more figures