Table of Contents
Fetching ...

Scalable Multi-Agent Reinforcement Learning for Residential Load Scheduling under Data Governance

Zhaoming Qin, Nanqing Dong, Di Liu, Zhefan Wang, Junwei Cao

TL;DR

This work addresses privacy and scalability challenges in multi-agent reinforcement learning for cooperative residential load scheduling under data governance. It introduces DADC (decentralized actors with distributed critics), where each household runs an on-device actor and a local critic that outputs a scalar value, which is sent to the cloud to compute a global value function through a lightweight feed-forward network. By decoupling value estimation into scalar local components and a central aggregator, DADC preserves household privacy, reduces cloud communication, and achieves linear scalability in the number of households, while maintaining competitive performance relative to privacy-unconstrained baselines such as DACC. Empirical results on real-world data show that DADC outperforms independent actor-critic (IAC) and approaches DACC in performance, with significant gains in implicit credit assignment and substantial reductions in communication and computation overhead, enabling practical cloud-edge deployment.

Abstract

As a data-driven approach, multi-agent reinforcement learning (MARL) has made remarkable advances in solving cooperative residential load scheduling problems. However, centralized training, the most common paradigm for MARL, limits large-scale deployment in communication-constrained cloud-edge environments. As a remedy, distributed training shows unparalleled advantages in real-world applications but still faces challenge with system scalability, e.g., the high cost of communication overhead during coordinating individual agents, and needs to comply with data governance in terms of privacy. In this work, we propose a novel MARL solution to address these two practical issues. Our proposed approach is based on actor-critic methods, where the global critic is a learned function of individual critics computed solely based on local observations of households. This scheme preserves household privacy completely and significantly reduces communication cost. Simulation experiments demonstrate that the proposed framework achieves comparable performance to the state-of-the-art actor-critic framework without data governance and communication constraints.

Scalable Multi-Agent Reinforcement Learning for Residential Load Scheduling under Data Governance

TL;DR

This work addresses privacy and scalability challenges in multi-agent reinforcement learning for cooperative residential load scheduling under data governance. It introduces DADC (decentralized actors with distributed critics), where each household runs an on-device actor and a local critic that outputs a scalar value, which is sent to the cloud to compute a global value function through a lightweight feed-forward network. By decoupling value estimation into scalar local components and a central aggregator, DADC preserves household privacy, reduces cloud communication, and achieves linear scalability in the number of households, while maintaining competitive performance relative to privacy-unconstrained baselines such as DACC. Empirical results on real-world data show that DADC outperforms independent actor-critic (IAC) and approaches DACC in performance, with significant gains in implicit credit assignment and substantial reductions in communication and computation overhead, enabling practical cloud-edge deployment.

Abstract

As a data-driven approach, multi-agent reinforcement learning (MARL) has made remarkable advances in solving cooperative residential load scheduling problems. However, centralized training, the most common paradigm for MARL, limits large-scale deployment in communication-constrained cloud-edge environments. As a remedy, distributed training shows unparalleled advantages in real-world applications but still faces challenge with system scalability, e.g., the high cost of communication overhead during coordinating individual agents, and needs to comply with data governance in terms of privacy. In this work, we propose a novel MARL solution to address these two practical issues. Our proposed approach is based on actor-critic methods, where the global critic is a learned function of individual critics computed solely based on local observations of households. This scheme preserves household privacy completely and significantly reduces communication cost. Simulation experiments demonstrate that the proposed framework achieves comparable performance to the state-of-the-art actor-critic framework without data governance and communication constraints.

Paper Structure

This paper contains 32 sections, 23 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: The cloud-edge environment for residential load scheduling. The DGs supply electricity to households. The HEMSs take as input public information from the DGs and private observations from local households, and generate control signals for local flexible appliances.
  • Figure 2: The framework of DADC in the cloud-edge environment. At the edge layer, the actor network and critic network of each HEMS yield individual policy and a scalar value $v_{i,t}$ only using its local observation $o_{i,t}$, respectively. The individual action $a_{i,t}$ is then sampled according to the generated policy, and the scalar value $v_{i,t}$ is communicated to cloud. At the cloud layer, a learnable feed-forward network maps the concatenation of $n$ scalar values to the global value estimation $v^{\mathrm{tot}}$.
  • Figure 3: (a) Individual actor network. This network takes as input the local observation $o_{i,t}$ and the hidden state $h^\pi_{i,t-1}$, and generates the probability distribution over the individual action space. (b) Individual critic network. This network takes as input the local observation $o_{i,t}$ and the hidden state $h^v_{i,t-1}$ as input, and yields the individual value estimation. (c) Feed-forward network. This network take as input the concatenation of $n$ scalars, and outputs the global value estimation.
  • Figure 4: Training curves of DADC and other frameworks. The solid curves corresponds to the mean and the shaded region to the minimum and maximum episode rewards over the all trials.
  • Figure 5: The value loss for critic networks. DADC achieve the lowest estimation bias for global value function.
  • ...and 2 more figures