Table of Contents
Fetching ...

Autonomous vehicles need social awareness to find optima in multi-agent reinforcement learning routing games

Anastasia Psarou, Łukasz Gorczyca, Dominik Gaweł, Rafał Kucharski

TL;DR

The paper addresses convergence and performance challenges when autonomous vehicles learn routing with MARL under selfish objectives. It introduces a counterfactual intrinsic reward based on marginal travel time, aggregating into a marginal cost matrix to induce social awareness while preserving equilibria. Empirical results on a toy TRY network and the real-world Saint-Arnoult network show faster convergence to the system-optimal solution and improvements in both individual AV travel times and overall system travel times. The approach demonstrates that socially aware routing can yield tangible efficiency gains in future urban traffic with AVs, though it incurs additional computational cost for marginal-cost evaluations.

Abstract

Previous work has shown that when multiple selfish Autonomous Vehicles (AVs) are introduced to future cities and start learning optimal routing strategies using Multi-Agent Reinforcement Learning (MARL), they may destabilize traffic systems, as they would require a significant amount of time to converge to the optimal solution, equivalent to years of real-world commuting. We demonstrate that moving beyond the selfish component in the reward significantly relieves this issue. If each AV, apart from minimizing its own travel time, aims to reduce its impact on the system, this will be beneficial not only for the system-wide performance but also for each individual player in this routing game. By introducing an intrinsic reward signal based on the marginal cost matrix, we significantly reduce training time and achieve convergence more reliably. Marginal cost quantifies the impact of each individual action (route-choice) on the system (total travel time). Including it as one of the components of the reward can reduce the degree of non-stationarity by aligning agents' objectives. Notably, the proposed counterfactual formulation preserves the system's equilibria and avoids oscillations. Our experiments show that training MARL algorithms with our novel reward formulation enables the agents to converge to the optimal solution, whereas the baseline algorithms fail to do so. We show these effects in both a toy network and the real-world network of Saint-Arnoult. Our results optimistically indicate that social awareness (i.e., including marginal costs in routing decisions) improves both the system-wide and individual performance of future urban systems with AVs.

Autonomous vehicles need social awareness to find optima in multi-agent reinforcement learning routing games

TL;DR

The paper addresses convergence and performance challenges when autonomous vehicles learn routing with MARL under selfish objectives. It introduces a counterfactual intrinsic reward based on marginal travel time, aggregating into a marginal cost matrix to induce social awareness while preserving equilibria. Empirical results on a toy TRY network and the real-world Saint-Arnoult network show faster convergence to the system-optimal solution and improvements in both individual AV travel times and overall system travel times. The approach demonstrates that socially aware routing can yield tangible efficiency gains in future urban traffic with AVs, though it incurs additional computational cost for marginal-cost evaluations.

Abstract

Previous work has shown that when multiple selfish Autonomous Vehicles (AVs) are introduced to future cities and start learning optimal routing strategies using Multi-Agent Reinforcement Learning (MARL), they may destabilize traffic systems, as they would require a significant amount of time to converge to the optimal solution, equivalent to years of real-world commuting. We demonstrate that moving beyond the selfish component in the reward significantly relieves this issue. If each AV, apart from minimizing its own travel time, aims to reduce its impact on the system, this will be beneficial not only for the system-wide performance but also for each individual player in this routing game. By introducing an intrinsic reward signal based on the marginal cost matrix, we significantly reduce training time and achieve convergence more reliably. Marginal cost quantifies the impact of each individual action (route-choice) on the system (total travel time). Including it as one of the components of the reward can reduce the degree of non-stationarity by aligning agents' objectives. Notably, the proposed counterfactual formulation preserves the system's equilibria and avoids oscillations. Our experiments show that training MARL algorithms with our novel reward formulation enables the agents to converge to the optimal solution, whereas the baseline algorithms fail to do so. We show these effects in both a toy network and the real-world network of Saint-Arnoult. Our results optimistically indicate that social awareness (i.e., including marginal costs in routing decisions) improves both the system-wide and individual performance of future urban systems with AVs.

Paper Structure

This paper contains 14 sections, 11 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview: When multiple selfish AVs are introduced into future cities and have to simultaneously learn optimal routing strategies using MARL they need long training iterations to converge to the optimal solution (a). However, if we incorporate the impact an AV's presence has on the other agents of the system, into its travel-time based reward value, can improve the convergence of MARL algorithms applied to the route choice problem of AVs (b). This enables faster convergence to the system-optimal and is can also be more beneficial for individual AV agents. We show this with experiments using RouteRL akman2025routerlmultiagentreinforcementlearning, a Multi-Agent Reinforcement Learning (MARL) framework that models the routing decisions of AVs and human drivers (c), and demonstrate our results using the Two-Route (Yield) network (TRY), d), where agents choose between Route 0, which is the shortest route without priority, and Route 1, a slightly longer alternative with priority.
  • Figure 2: Snapshot from SUMO of the TRY network. Red vehicles represent human agents and yellow ones AVs.
  • Figure 3: Introducing the marginal travel time in the reward of the AV agents accelerates convergence of the AVs to the optimal solution. We demonstrate this by incorporating two types of marginal travel times (MTTs): one that considers the impact on AV agents, AV group marginal, and the second that considers the impact on all drivers, system marginal (eq. \ref{['eq:intrinsic_reward_part2']}). We also demonstrate that our method enhances convergence in non-deterministic traffic dynamics, which more accurately represent real-world traffic conditions. When the proportion of AVs choosing the optimal action is close to 1, it indicates that nearly all agents selected the optimal solution. A proportion of 0.5 indicates that half of the agents chose the optimal option. The last 100 iterations of the plots depict the evaluation mode, where agents use the learned policy without exploration.
  • Figure 4: Effect of different values of the shaping coefficient $\beta$ on the convergence of the UCB (a), MAPPO (b), and IDQN (c) algorithms. Higher values of $\beta$ lead to faster convergence to the optimal solution. When the proportion of agents choosing the system optimal action is close to 1, it indicates that nearly all agents selected the optimal solution.
  • Figure 5: Number of equilibria in the system for different $\alpha$, $\beta$ coefficients. When $\beta > 0$ and $\alpha = 1$, the system has one unique equilibrium, the same as the equilibrium when the AV agents are selfish ($\beta = 0$ and $\alpha = 1$). This suggests that with $\beta$ values above a certain threshold and $\alpha = 1$, the equilibrium state remains unchanged, as discussed in section \ref{['sec:equilibria']}. Additionally, for hypothetical values of $\beta < 0$ (malicious behavior, jamroz2025social), multiple equilibria become possible in the system.