Safe and Sustainable Electric Bus Charging Scheduling with Constrained Hierarchical DRL
Jiaju Qi, Lei Lei, Thorsteinn Jonsson, Dusit Niyato
TL;DR
The paper tackles safe and economical charging of electric bus fleets under PV and price uncertainty by formulating the problem as a constrained Markov decision process with temporal abstraction (options). It introduces a novel DAC-MAPPO-Lagrangian algorithm that combines a centralized high-level PPO-Lagrangian policy for charger allocation with decentralized low-level MAPPO-Lagrangian policies for per-EB charging, trained under a CTDE framework. Empirical results with real-world PV and price data show that the proposed method approaches the performance of an oracle MILP solution while significantly reducing safety violations and improving convergence stability, especially as fleet size grows. The work demonstrates the practical viability of safe HDRL for uncertainty-aware transportation systems and provides a principled mechanism to balance cost and safety without manual penalty tuning.
Abstract
The integration of Electric Buses (EBs) with renewable energy sources such as photovoltaic (PV) panels is a promising approach to promote sustainable and low-carbon public transportation. However, optimizing EB charging schedules to minimize operational costs while ensuring safe operation without battery depletion remains challenging - especially under real-world conditions, where uncertainties in PV generation, dynamic electricity prices, variable travel times, and limited charging infrastructure must be accounted for. In this paper, we propose a safe Hierarchical Deep Reinforcement Learning (HDRL) framework for solving the EB Charging Scheduling Problem (EBCSP) under multi-source uncertainties. We formulate the problem as a Constrained Markov Decision Process (CMDP) with options to enable temporally abstract decision-making. We develop a novel HDRL algorithm, namely Double Actor-Critic Multi-Agent Proximal Policy Optimization Lagrangian (DAC-MAPPO-Lagrangian), which integrates Lagrangian relaxation into the Double Actor-Critic (DAC) framework. At the high level, we adopt a centralized PPO-Lagrangian algorithm to learn safe charger allocation policies. At the low level, we incorporate MAPPO-Lagrangian to learn decentralized charging power decisions under the Centralized Training and Decentralized Execution (CTDE) paradigm. Extensive experiments with real-world data demonstrate that the proposed approach outperforms existing baselines in both cost minimization and safety compliance, while maintaining fast convergence speed.
