Table of Contents
Fetching ...

A Simple Framework Towards Vision-based Traffic Signal Control with Microscopic Simulation

Pan He, Quanyi Li, Xiaoyong Yuan, Bolei Zhou

TL;DR

This work introduces TrafficDojo, a modular, open-source framework that bridges microscopic SUMO traffic simulations with the 3D rendering capability of MetaDrive to enable vision-based traffic signal control (TSC). By providing flexible sensor configurations, multi-view BEV observations, and a suite of RL and traditional baselines, the authors benchmark end-to-end TSC policies directly from visual inputs, including the use of frozen foundational models for feature extraction. Key findings show that BEV-based vision RL policies, especially when paired with discriminative features from foundational models (BEV*), can outperform feature-based approaches and offer improved convergence and reduced CO2 emissions in simulated intersections. The framework's open, extensible design and the dual-SUMO/MetaDrive setup open opportunities for robust evaluation of vision-based TSC and accelerate research across reinforcement learning and transportation communities.

Abstract

Traffic signal control (TSC) is crucial for reducing traffic congestion leading to smoother traffic flow, reduced idle time, and mitigated CO2 emissions. In this paper, we explore the computer vision approach for TSC that modulates on-road traffic flows through visual observation. Unlike traditional feature-based approaches, vision-based methods depend much less on heuristics and predefined features, bringing promising potentials for end-to-end learning and optimization of traffic signals. Thus, we introduce a simple traffic simulation framework called TrafficDojo towards vision-based TSC and its benchmark by integrating the microscopic traffic flow provided in SUMO into the 3D driving simulator MetaDrive. This proposed framework offers a versatile traffic environment for in-depth analysis and comprehensive evaluation of traffic signal controllers across diverse traffic conditions and scenarios. We establish and compare baseline algorithms including both traditional and Reinforcement Learning (RL) approaches. This work sheds light on the design and development of vision-based TSC approaches and opens up new research opportunities

A Simple Framework Towards Vision-based Traffic Signal Control with Microscopic Simulation

TL;DR

This work introduces TrafficDojo, a modular, open-source framework that bridges microscopic SUMO traffic simulations with the 3D rendering capability of MetaDrive to enable vision-based traffic signal control (TSC). By providing flexible sensor configurations, multi-view BEV observations, and a suite of RL and traditional baselines, the authors benchmark end-to-end TSC policies directly from visual inputs, including the use of frozen foundational models for feature extraction. Key findings show that BEV-based vision RL policies, especially when paired with discriminative features from foundational models (BEV*), can outperform feature-based approaches and offer improved convergence and reduced CO2 emissions in simulated intersections. The framework's open, extensible design and the dual-SUMO/MetaDrive setup open opportunities for robust evaluation of vision-based TSC and accelerate research across reinforcement learning and transportation communities.

Abstract

Traffic signal control (TSC) is crucial for reducing traffic congestion leading to smoother traffic flow, reduced idle time, and mitigated CO2 emissions. In this paper, we explore the computer vision approach for TSC that modulates on-road traffic flows through visual observation. Unlike traditional feature-based approaches, vision-based methods depend much less on heuristics and predefined features, bringing promising potentials for end-to-end learning and optimization of traffic signals. Thus, we introduce a simple traffic simulation framework called TrafficDojo towards vision-based TSC and its benchmark by integrating the microscopic traffic flow provided in SUMO into the 3D driving simulator MetaDrive. This proposed framework offers a versatile traffic environment for in-depth analysis and comprehensive evaluation of traffic signal controllers across diverse traffic conditions and scenarios. We establish and compare baseline algorithms including both traditional and Reinforcement Learning (RL) approaches. This work sheds light on the design and development of vision-based TSC approaches and opens up new research opportunities
Paper Structure (22 sections, 5 figures, 1 table)

This paper contains 22 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of TrafficDojo. TrafficDojo supports generating rich 3D visual scenarios from SUMO maps, leveraged by a visual rendering engine of MetaDrive. At each time step, TrafficDojo implements a synchronization mechanism to direct the synchronous creation, updating, and removal of vehicles and pedestrians between SUMO and MetaDrive. TrafficDojo thus provides a Gym interactive environment tailored for traffic signal control, equipped with the capability to capture visual data from sensors like RGB cameras and LIDARs positioned at a traffic intersection. Additionally, it directly connects to the common RL training platforms such as Stable-Baseline3 raffin2021stable and RLLib liang2018rllib where a wide range of RL algorithms can be evaluated.
  • Figure 2: (Left) The multi-view observation setup at an intersection. It involves cameras positioned at a height of 10 meters, each focusing on the respective approaching lanes. The camera height can be adjusted to provide images in different views. (Right) The coordination of traffic signals and participants between MetaDrive and SUMO is achieved through synchronization. The status of traffic lights for specific links is represented by green, yellow, and red lines. The left image is the rendered image from MetaDrive and the right image is the SUMO visualization.
  • Figure 3: The execution logic of TrafficDojo. Multiple managers are registered and executed in specific orders for managing traffic signal controllers, synchronizations, and maps.
  • Figure 4: Evaluation of the trained policies from various RL controllers in the two-way single intersection, with a duration of $3,600$ seconds. The curves are smoothed using a moving average with a window size of $50$. (Left) Evaluation of CO2 emissions. (Right) Evaluation of average delay.
  • Figure 5: Evaluation of the trained policies from model variants of DQN in the Cologne 1x1 scenario, with a duration of $3,600$ seconds and a start time of the $25,200_{th}$ second. The curves are smoothed using a moving average with a window size of $50$. (Left) Evaluation of the accumulated waiting time. (Right) Evaluation of average queue length.