Autonomous vehicles need social awareness to find optima in multi-agent reinforcement learning routing games
Anastasia Psarou, Łukasz Gorczyca, Dominik Gaweł, Rafał Kucharski
TL;DR
The paper addresses convergence and performance challenges when autonomous vehicles learn routing with MARL under selfish objectives. It introduces a counterfactual intrinsic reward based on marginal travel time, aggregating into a marginal cost matrix to induce social awareness while preserving equilibria. Empirical results on a toy TRY network and the real-world Saint-Arnoult network show faster convergence to the system-optimal solution and improvements in both individual AV travel times and overall system travel times. The approach demonstrates that socially aware routing can yield tangible efficiency gains in future urban traffic with AVs, though it incurs additional computational cost for marginal-cost evaluations.
Abstract
Previous work has shown that when multiple selfish Autonomous Vehicles (AVs) are introduced to future cities and start learning optimal routing strategies using Multi-Agent Reinforcement Learning (MARL), they may destabilize traffic systems, as they would require a significant amount of time to converge to the optimal solution, equivalent to years of real-world commuting. We demonstrate that moving beyond the selfish component in the reward significantly relieves this issue. If each AV, apart from minimizing its own travel time, aims to reduce its impact on the system, this will be beneficial not only for the system-wide performance but also for each individual player in this routing game. By introducing an intrinsic reward signal based on the marginal cost matrix, we significantly reduce training time and achieve convergence more reliably. Marginal cost quantifies the impact of each individual action (route-choice) on the system (total travel time). Including it as one of the components of the reward can reduce the degree of non-stationarity by aligning agents' objectives. Notably, the proposed counterfactual formulation preserves the system's equilibria and avoids oscillations. Our experiments show that training MARL algorithms with our novel reward formulation enables the agents to converge to the optimal solution, whereas the baseline algorithms fail to do so. We show these effects in both a toy network and the real-world network of Saint-Arnoult. Our results optimistically indicate that social awareness (i.e., including marginal costs in routing decisions) improves both the system-wide and individual performance of future urban systems with AVs.
