Table of Contents
Fetching ...

Characterizing MARL for Energy Control: A Multi-KPI Benchmark on the CityLearn Environment

Aymen Khouja, Imen Jendoubi, Oumayma Mahjoub, Oussama Mahfoudhi, Claude Formanek, Siddarth Singh, Ruan De Kock

TL;DR

Results show that DTDE consistently outperforms CTDE in both average and worst-case performance, and temporal dependency learning improved control on memory dependent KPIs such as ramping and battery usage, contributing to more sustainable battery operation.

Abstract

The optimization of urban energy systems is crucial for the advancement of sustainable and resilient smart cities, which are becoming increasingly complex with multiple decision-making units. To address scalability and coordination concerns, Multi-Agent Reinforcement Learning (MARL) is a promising solution. This paper addresses the imperative need for comprehensive and reliable benchmarking of MARL algorithms on energy management tasks. CityLearn is used as a case study environment because it realistically simulates urban energy systems, incorporates multiple storage systems, and utilizes renewable energy sources. By doing so, our work sets a new standard for evaluation, conducting a comparative study across multiple key performance indicators (KPIs). This approach illuminates the key strengths and weaknesses of various algorithms, moving beyond traditional KPI averaging which often masks critical insights. Our experiments utilize widely accepted baselines such as Proximal Policy Optimization (PPO) and Soft Actor Critic (SAC), and encompass diverse training schemes including Decentralized Training with Decentralized Execution (DTDE) and Centralized Training with Decentralized Execution (CTDE) approaches and different neural network architectures. Our work also proposes novel KPIs that tackle real world implementation challenges such as individual building contribution and battery storage lifetime. Our findings show that DTDE consistently outperforms CTDE in both average and worst-case performance. Additionally, temporal dependency learning improved control on memory dependent KPIs such as ramping and battery usage, contributing to more sustainable battery operation. Results also reveal robustness to agent or resource removal, highlighting both the resilience and decentralizability of the learned policies.

Characterizing MARL for Energy Control: A Multi-KPI Benchmark on the CityLearn Environment

TL;DR

Results show that DTDE consistently outperforms CTDE in both average and worst-case performance, and temporal dependency learning improved control on memory dependent KPIs such as ramping and battery usage, contributing to more sustainable battery operation.

Abstract

The optimization of urban energy systems is crucial for the advancement of sustainable and resilient smart cities, which are becoming increasingly complex with multiple decision-making units. To address scalability and coordination concerns, Multi-Agent Reinforcement Learning (MARL) is a promising solution. This paper addresses the imperative need for comprehensive and reliable benchmarking of MARL algorithms on energy management tasks. CityLearn is used as a case study environment because it realistically simulates urban energy systems, incorporates multiple storage systems, and utilizes renewable energy sources. By doing so, our work sets a new standard for evaluation, conducting a comparative study across multiple key performance indicators (KPIs). This approach illuminates the key strengths and weaknesses of various algorithms, moving beyond traditional KPI averaging which often masks critical insights. Our experiments utilize widely accepted baselines such as Proximal Policy Optimization (PPO) and Soft Actor Critic (SAC), and encompass diverse training schemes including Decentralized Training with Decentralized Execution (DTDE) and Centralized Training with Decentralized Execution (CTDE) approaches and different neural network architectures. Our work also proposes novel KPIs that tackle real world implementation challenges such as individual building contribution and battery storage lifetime. Our findings show that DTDE consistently outperforms CTDE in both average and worst-case performance. Additionally, temporal dependency learning improved control on memory dependent KPIs such as ramping and battery usage, contributing to more sustainable battery operation. Results also reveal robustness to agent or resource removal, highlighting both the resilience and decentralizability of the learned policies.
Paper Structure (18 sections, 4 equations, 16 figures, 9 tables, 1 algorithm)

This paper contains 18 sections, 4 equations, 16 figures, 9 tables, 1 algorithm.

Figures (16)

  • Figure 1: An illustrative overview of our comprehensive MARL evaluation framework. (a) A multi-objective comparison showing algorithm performance trade-offs across nine distinct KPIs. (b) A sample analysis quantifying granular metrics, such as the relative contribution of individual agents. (c) A probabilistic rank distribution from our statistical robustness analysis, summarizing algorithm performance over multiple trial. (d) Pairwise improvement probabilities indicating how likely one algorithm is to outperform another.
  • Figure 2: Sample efficiency curves for algorithms on the average score metric, showing SAC’s strong early gains but weaker plateaued performance, while IPPO ultimately achieves the lowest average score and best overall results.
  • Figure 3: Aggregate Performance Analysis of Control Algorithms. This figure presents a consolidated view of algorithm performance across all evaluation scenarios. (a) The IQM of the aggregate score, where lower is better. IPPO achieves the best average performance. (b) The CVaR of the score highlights worst-case outcomes, again showing IPPO's superiority and indicating robust performance. The wide confidence intervals for MAPPO in both (a) and (b) reveal its high sensitivity to random seeds.
  • Figure 4: Comparison of Key Performance Indicators (KPIs) Across Algorithms. Performance on three environmental metrics, where lower IQM scores (a-c) are better. (a) Ramping (IQM): Recurrent models significantly outperform their feedforward counterparts on this metric. (b) Carbon Emissions (IQM): All algorithms perform similarly, with no clear, consistent advantage from temporal dependency. (c) Discomfort Proportion (IQM): Recurrent-SAC and IPPO achieve the strongest performance, highlighting decentralized variants' potential on this metric. (d–f) Probability of Improvement: These plots mirror the trends in (a–c), showing that recurrent models outperform feedforward variants on ramping (d), but offers no clear advantage on carbon emissions (e), and on discomfort reduction (f). Notably, beyond Recurrent-IPPO, the decentralized variants also exhibit a high probability of outperforming other approaches on the discomfort metric.
  • Figure 5: Battery Usage Patterns for Electrical and Hot Water Storage. This figure compares the battery management strategies learned by the algorithms, focusing on Depth of Discharge (DoD) and average discharge duration. Lower DoD IQM scores indicate less strain on the batteries while higher duration IQM scores indicate longer discharges which is more beneficial. (a, c) The DoD for both electrical and hot water storage is consistently lower for recurrent independent learners (Rec-IPPO and Rec-SAC), indicating shallower and less stressful discharge cycles. (b, d) Similarly, these recurrent agents achieve longer average discharge durations for electrical storage, reflecting more frequent and responsive battery usage.
  • ...and 11 more figures