Table of Contents
Fetching ...

Enhancing Traffic Signal Control through Model-based Reinforcement Learning and Policy Reuse

Yihong Li, Chengwei Zhang, Furui Zhan, Wanting Liu, Kailing Zhou, Longji Zheng

TL;DR

Formulated as a multi-agent partially observable Markov game $M= \langle \mathcal{N}, \mathcal{S}, \mathcal{O}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \rangle$, the paper addresses generalization and data efficiency in MARL-based ATSC. The authors propose PLight, a model-based pretraining framework that learns an environmental state-transition model via an Encoder-Decoder-Q architecture, and PRLight, a policy-reuse transfer mechanism that selects similarity-weighted guide agents from an agent pool to accelerate learning in a target domain, with the target policy maximizing $Q^{\pi}_{\mathcal{M}_{tar}}(s,a)$. Key contributions include the two-stage PLight/PRLight framework, an encoder-decoder environmental model that predicts next observations $\hat{o}'_i$, and a similarity-based policy reuse strategy that reduces exploration costs and improves convergence stability across within-network and cross-network transfers. Empirical results on CityFlow, including New York, Jinan, and Hangzhou TOD scenarios, demonstrate faster adaptation and robust generalization, with the decoder component showing limited impact on final policy performance.

Abstract

Multi-agent reinforcement learning (MARL) has shown significant potential in traffic signal control (TSC). However, current MARL-based methods often suffer from insufficient generalization due to the fixed traffic patterns and road network conditions used during training. This limitation results in poor adaptability to new traffic scenarios, leading to high retraining costs and complex deployment. To address this challenge, we propose two algorithms: PLight and PRLight. PLight employs a model-based reinforcement learning approach, pretraining control policies and environment models using predefined source-domain traffic scenarios. The environment model predicts the state transitions, which facilitates the comparison of environmental features. PRLight further enhances adaptability by adaptively selecting pre-trained PLight agents based on the similarity between the source and target domains to accelerate the learning process in the target domain. We evaluated the algorithms through two transfer settings: (1) adaptability to different traffic scenarios within the same road network, and (2) generalization across different road networks. The results show that PRLight significantly reduces the adaptation time compared to learning from scratch in new TSC scenarios, achieving optimal performance using similarities between available and target scenarios.

Enhancing Traffic Signal Control through Model-based Reinforcement Learning and Policy Reuse

TL;DR

Formulated as a multi-agent partially observable Markov game , the paper addresses generalization and data efficiency in MARL-based ATSC. The authors propose PLight, a model-based pretraining framework that learns an environmental state-transition model via an Encoder-Decoder-Q architecture, and PRLight, a policy-reuse transfer mechanism that selects similarity-weighted guide agents from an agent pool to accelerate learning in a target domain, with the target policy maximizing . Key contributions include the two-stage PLight/PRLight framework, an encoder-decoder environmental model that predicts next observations , and a similarity-based policy reuse strategy that reduces exploration costs and improves convergence stability across within-network and cross-network transfers. Empirical results on CityFlow, including New York, Jinan, and Hangzhou TOD scenarios, demonstrate faster adaptation and robust generalization, with the decoder component showing limited impact on final policy performance.

Abstract

Multi-agent reinforcement learning (MARL) has shown significant potential in traffic signal control (TSC). However, current MARL-based methods often suffer from insufficient generalization due to the fixed traffic patterns and road network conditions used during training. This limitation results in poor adaptability to new traffic scenarios, leading to high retraining costs and complex deployment. To address this challenge, we propose two algorithms: PLight and PRLight. PLight employs a model-based reinforcement learning approach, pretraining control policies and environment models using predefined source-domain traffic scenarios. The environment model predicts the state transitions, which facilitates the comparison of environmental features. PRLight further enhances adaptability by adaptively selecting pre-trained PLight agents based on the similarity between the source and target domains to accelerate the learning process in the target domain. We evaluated the algorithms through two transfer settings: (1) adaptability to different traffic scenarios within the same road network, and (2) generalization across different road networks. The results show that PRLight significantly reduces the adaptation time compared to learning from scratch in new TSC scenarios, achieving optimal performance using similarities between available and target scenarios.

Paper Structure

This paper contains 21 sections, 15 equations, 8 figures, 3 tables, 2 algorithms.

Figures (8)

  • Figure 1: Four Traffic Signal Phases.
  • Figure 2: Overall Architecture of the methods. The methods divided into two stages: source domain training and target domain transfer. On the left side (PLight: pre-training), the agent model’s network structure and training process for source domain tasks (illustrated using domain $k$—with identical processes for other domains) are shown, and the trained models are stored in an agent pool. On the right side (PRLight: transfer) displays the transfer and training process in the target domain task. Each block in the agent pool denotes an agent, where E stands for the Encoder structure, D for the Decoder structure, and Q for the Q-network.
  • Figure 3: PLight Network Architecture with Depicted Inputs and Outputs
  • Figure 4: Guide Agent Selection: Single-Step Similarity Computation and Guide Agent Acquisition via Weights
  • Figure 5: PCA Analysis of Traffic Flow Data: 3 Datasets from Jinan and 4 Datasets from Hangzhou
  • ...and 3 more figures