Table of Contents
Fetching ...

Centrally Coordinated Multi-Agent Reinforcement Learning for Power Grid Topology Control

Barbera de Mol, Davide Barbieri, Jan Viebahn, Davide Grossi

TL;DR

The paper tackles the combinatorial challenge of power grid topology control under rising renewable integration by introducing a centrally coordinated multi-agent reinforcement learning (CCMA) framework that factorizes the action space. Regional agents propose topological changes and a coordinating agent selects the final action, enabling scalable, sample-efficient learning in larger networks. Across 5- and 14-bus grids (with stochastic outages) and preliminary 36-bus experiments, CCMA variants outperform baselines in both learning efficiency and final performance, with Greedy-RL often delivering the best results. The work demonstrates the practical potential of combining HRL and MARL for robust, scalable power system control, while outlining concrete avenues for scaling to larger networks and incorporating imitation learning and partial observability.

Abstract

Power grid operation is becoming more complex due to the increase in generation of renewable energy. The recent series of Learning To Run a Power Network (L2RPN) competitions have encouraged the use of artificial agents to assist human dispatchers in operating power grids. However, the combinatorial nature of the action space poses a challenge to both conventional optimizers and learned controllers. Action space factorization, which breaks down decision-making into smaller sub-tasks, is one approach to tackle the curse of dimensionality. In this study, we propose a centrally coordinated multi-agent (CCMA) architecture for action space factorization. In this approach, regional agents propose actions and subsequently a coordinating agent selects the final action. We investigate several implementations of the CCMA architecture, and benchmark in different experimental settings against various L2RPN baseline approaches. The CCMA architecture exhibits higher sample efficiency and superior final performance than the baseline approaches. The results suggest high potential of the CCMA approach for further application in higher-dimensional L2RPN as well as real-world power grid settings.

Centrally Coordinated Multi-Agent Reinforcement Learning for Power Grid Topology Control

TL;DR

The paper tackles the combinatorial challenge of power grid topology control under rising renewable integration by introducing a centrally coordinated multi-agent reinforcement learning (CCMA) framework that factorizes the action space. Regional agents propose topological changes and a coordinating agent selects the final action, enabling scalable, sample-efficient learning in larger networks. Across 5- and 14-bus grids (with stochastic outages) and preliminary 36-bus experiments, CCMA variants outperform baselines in both learning efficiency and final performance, with Greedy-RL often delivering the best results. The work demonstrates the practical potential of combining HRL and MARL for robust, scalable power system control, while outlining concrete avenues for scaling to larger networks and incorporating imitation learning and partial observability.

Abstract

Power grid operation is becoming more complex due to the increase in generation of renewable energy. The recent series of Learning To Run a Power Network (L2RPN) competitions have encouraged the use of artificial agents to assist human dispatchers in operating power grids. However, the combinatorial nature of the action space poses a challenge to both conventional optimizers and learned controllers. Action space factorization, which breaks down decision-making into smaller sub-tasks, is one approach to tackle the curse of dimensionality. In this study, we propose a centrally coordinated multi-agent (CCMA) architecture for action space factorization. In this approach, regional agents propose actions and subsequently a coordinating agent selects the final action. We investigate several implementations of the CCMA architecture, and benchmark in different experimental settings against various L2RPN baseline approaches. The CCMA architecture exhibits higher sample efficiency and superior final performance than the baseline approaches. The results suggest high potential of the CCMA approach for further application in higher-dimensional L2RPN as well as real-world power grid settings.

Paper Structure

This paper contains 41 sections, 6 equations, 6 figures, 11 tables, 1 algorithm.

Figures (6)

  • Figure 1: Diagram of the Feedback Control Framework used in this research. The goal to be achieved (and maintained) in the system is an input to the entire framework. The comparator determines the difference between the current state of the system and the desired one. There are then two possible paths for the control flow. If a goal state is reached then no action is performed and the reward is accumulated. If the system is not in the goal state then an agent is called. Periodically this agent can be updated by a trainer. The dashed line represents the only link between the two loops, as the trainer will perceive the accumulated rewards of consecutive time-steps where the do-nothing was used for control
  • Figure 2: Left: Hierarchy proposed by manczak2023hierarchicalvan2023multi. First a region (e.g., a substation) is selected, and then a topological configuration for that region (and that region only) is selected. Right: Hierarchy proposed in this study. For each region a topological reconfiguration is proposed concurrently, and a coordinator then selects the best proposed action (e.g., a substation configuration).
  • Figure 3: An overview of all possible multi-agent architectures. First the maximum line loading in the current observation OBS is compared to a threshold by the gate to determine whether to not act (upper path) or to try to reconfigure the topology of the grid (lower path). In the latter case, first all the REGIONAL AGENTS propose an action to reconfigure their respective region (or do nothing). This list of action is finally passed to the COORDINATING AGENT which select one of the regions (i.e. one of the actions proposed by any of the regional agents). The final result is always an action in the topological reconfiguration space.
  • Figure 4: The mean number of timesteps survived on all validation scenarios throughout training for the 5-bus network without opponent (top left), the 5-bus network with opponent (top right), the 14-bus network without opponent (bottom left) and the 14-bus network with opponent (bottom right). For the mean and standard deviation, seeds are interpolated between 0 and max_timesteps to align them on a common time axis.
  • Figure 5: The mean performance of the Single RL Agent compared to the individual seeds of the Greedy-RL and RL-RL agent on the 14-bus environment with opponent. For the mean and standard deviation, seeds are interpolated between 0 and max_timesteps to align them on a common time axis.
  • ...and 1 more figures