Table of Contents
Fetching ...

Using a single actor to output personalized policy for different intersections

Kailing Zhou, Chengwei Zhang, Furui Zhan, Wanting Liu, Yihong Li

TL;DR

This paper tackles the challenge of personalizing traffic signal policies across non-identically distributed intersections while preserving efficient parameter sharing. It introduces HAMH-PPO, a Centralized Training with Decentralized Execution MARL framework that employs a Graph Attention-based MH-Critic to output multiple value heads and a hyper-action mechanism in the HA-Actor to weight these heads, yielding intersection-specific guidance without proliferating parameters. The approach demonstrates superior performance over traditional and MARL baselines across six CityFlow scenarios, particularly excelling in large, irregular networks, and shows that hyper-action weights adapt dynamically to traffic conditions, enabling true personalization. The work contributes a practical, scalable method for large-scale adaptive traffic signal control with personalized policies, and releases the source code for reproducibility.

Abstract

Recently, with the development of Multi-agent reinforcement learning (MARL), adaptive traffic signal control (ATSC) has achieved satisfactory results. In traffic scenarios with multiple intersections, MARL treats each intersection as an agent and optimizes traffic signal control strategies through learning and real-time decision-making. Considering that observation distributions of intersections might be different in real-world scenarios, shared parameter methods might lack diversity and thus lead to high generalization requirements in the shared-policy network. A typical solution is to increase the size of network parameters. However, simply increasing the scale of the network does not necessarily improve policy generalization, which is validated in our experiments. Accordingly, an approach that considers both the personalization of intersections and the efficiency of parameter sharing is required. To this end, we propose Hyper-Action Multi-Head Proximal Policy Optimization (HAMH-PPO), a Centralized Training with Decentralized Execution (CTDE) MARL method that utilizes a shared PPO policy network to deliver personalized policies for intersections with non-iid observation distributions. The centralized critic in HAMH-PPO uses graph attention units to calculate the graph representations of all intersections and outputs a set of value estimates with multiple output heads for each intersection. The decentralized execution actor takes the local observation history as input and output distributions of action as well as a so-called hyper-action to balance the multiple values estimated from the centralized critic to further guide the updating of TSC policies. The combination of hyper-action and multi-head values enables multiple agents to share a single actor-critic while achieving personalized policies.

Using a single actor to output personalized policy for different intersections

TL;DR

This paper tackles the challenge of personalizing traffic signal policies across non-identically distributed intersections while preserving efficient parameter sharing. It introduces HAMH-PPO, a Centralized Training with Decentralized Execution MARL framework that employs a Graph Attention-based MH-Critic to output multiple value heads and a hyper-action mechanism in the HA-Actor to weight these heads, yielding intersection-specific guidance without proliferating parameters. The approach demonstrates superior performance over traditional and MARL baselines across six CityFlow scenarios, particularly excelling in large, irregular networks, and shows that hyper-action weights adapt dynamically to traffic conditions, enabling true personalization. The work contributes a practical, scalable method for large-scale adaptive traffic signal control with personalized policies, and releases the source code for reproducibility.

Abstract

Recently, with the development of Multi-agent reinforcement learning (MARL), adaptive traffic signal control (ATSC) has achieved satisfactory results. In traffic scenarios with multiple intersections, MARL treats each intersection as an agent and optimizes traffic signal control strategies through learning and real-time decision-making. Considering that observation distributions of intersections might be different in real-world scenarios, shared parameter methods might lack diversity and thus lead to high generalization requirements in the shared-policy network. A typical solution is to increase the size of network parameters. However, simply increasing the scale of the network does not necessarily improve policy generalization, which is validated in our experiments. Accordingly, an approach that considers both the personalization of intersections and the efficiency of parameter sharing is required. To this end, we propose Hyper-Action Multi-Head Proximal Policy Optimization (HAMH-PPO), a Centralized Training with Decentralized Execution (CTDE) MARL method that utilizes a shared PPO policy network to deliver personalized policies for intersections with non-iid observation distributions. The centralized critic in HAMH-PPO uses graph attention units to calculate the graph representations of all intersections and outputs a set of value estimates with multiple output heads for each intersection. The decentralized execution actor takes the local observation history as input and output distributions of action as well as a so-called hyper-action to balance the multiple values estimated from the centralized critic to further guide the updating of TSC policies. The combination of hyper-action and multi-head values enables multiple agents to share a single actor-critic while achieving personalized policies.

Paper Structure

This paper contains 20 sections, 9 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: The traffic flow distribution and the observed distribution of a 1 $\times$ 3 network, as well as the learning curve for parameter sharing and non-parameter sharing methods (plotted based on 5 testing runs, with smoothing applied every 9 episodes.). This means that the personalization of the intersection needs to be considered.
  • Figure 2: Signal phase and corresponding action set of crossroads.
  • Figure 3: The overall structure of HAMH-PPO is illustrated in Figure (a) and is designed based on the PPO algorithm. It consists of N agents, each corresponding to an intersection, and a centralized critic. Figure (b) shows the structure of a single actor, where local observations are passed through feature extraction and output action $a_i^t$ that affects the environment, while cyclic features combined with the intersection index $i$ output the hyper-action $w_i^t$. Figure (c) represents the structure of the critic, where global observations and graph information are processed using GAT to generate a set of value functions for each intersection. The joint value function, obtained by dot product with the hyper-action, estimates the average travel time at the current intersection.
  • Figure 4: The detailed structure of the cubic network, which ultimately outputs $k$-dimensional value estimates.
  • Figure 5: Performance comparison of RL methods (PNC-HDQN '---', MA2C '---', Colight '---', MPLight '---' and HAMH-PPO '---') in six datasets.
  • ...and 3 more figures