Table of Contents
Fetching ...

VLMLight: Safety-Critical Traffic Signal Control via Vision-Language Meta-Control and Dual-Branch Reasoning Architecture

Maonan Wang, Yirong Chen, Aoyu Pang, Yuxin Cai, Chung Shue Chen, Yuheng Kan, Man-On Pun

TL;DR

VLMLight addresses safety-critical traffic signal control by fusing vision-language scene grounding with a safety-prioritized meta-controller that dynamically selects between a fast reinforcement learning policy and a deliberative LLM-based reasoning branch. It introduces the first image-based, multi-view intersection simulator and a modular agent architecture (Scene, ModeSelector, PhaseReasoning, Plan, Check) that provides interpretable, auditable decisions. Empirical results show up to a 65% reduction in emergency-vehicle waiting times with less than 1% degradation in routine traffic, and end-to-end decision latency of about 11.5 seconds, suggesting practical real-time feasibility. The work also contributes an open-source vision-based TSC simulator and demonstrates the value of combining perception-grounded planning with structured safety checks for scalable deployment in urban networks.

Abstract

Traffic signal control (TSC) is a core challenge in urban mobility, where real-time decisions must balance efficiency and safety. Existing methods - ranging from rule-based heuristics to reinforcement learning (RL) - often struggle to generalize to complex, dynamic, and safety-critical scenarios. We introduce VLMLight, a novel TSC framework that integrates vision-language meta-control with dual-branch reasoning. At the core of VLMLight is the first image-based traffic simulator that enables multi-view visual perception at intersections, allowing policies to reason over rich cues such as vehicle type, motion, and spatial density. A large language model (LLM) serves as a safety-prioritized meta-controller, selecting between a fast RL policy for routine traffic and a structured reasoning branch for critical cases. In the latter, multiple LLM agents collaborate to assess traffic phases, prioritize emergency vehicles, and verify rule compliance. Experiments show that VLMLight reduces waiting times for emergency vehicles by up to 65% over RL-only systems, while preserving real-time performance in standard conditions with less than 1% degradation. VLMLight offers a scalable, interpretable, and safety-aware solution for next-generation traffic signal control.

VLMLight: Safety-Critical Traffic Signal Control via Vision-Language Meta-Control and Dual-Branch Reasoning Architecture

TL;DR

VLMLight addresses safety-critical traffic signal control by fusing vision-language scene grounding with a safety-prioritized meta-controller that dynamically selects between a fast reinforcement learning policy and a deliberative LLM-based reasoning branch. It introduces the first image-based, multi-view intersection simulator and a modular agent architecture (Scene, ModeSelector, PhaseReasoning, Plan, Check) that provides interpretable, auditable decisions. Empirical results show up to a 65% reduction in emergency-vehicle waiting times with less than 1% degradation in routine traffic, and end-to-end decision latency of about 11.5 seconds, suggesting practical real-time feasibility. The work also contributes an open-source vision-based TSC simulator and demonstrates the value of combining perception-grounded planning with structured safety checks for scalable deployment in urban networks.

Abstract

Traffic signal control (TSC) is a core challenge in urban mobility, where real-time decisions must balance efficiency and safety. Existing methods - ranging from rule-based heuristics to reinforcement learning (RL) - often struggle to generalize to complex, dynamic, and safety-critical scenarios. We introduce VLMLight, a novel TSC framework that integrates vision-language meta-control with dual-branch reasoning. At the core of VLMLight is the first image-based traffic simulator that enables multi-view visual perception at intersections, allowing policies to reason over rich cues such as vehicle type, motion, and spatial density. A large language model (LLM) serves as a safety-prioritized meta-controller, selecting between a fast RL policy for routine traffic and a structured reasoning branch for critical cases. In the latter, multiple LLM agents collaborate to assess traffic phases, prioritize emergency vehicles, and verify rule compliance. Experiments show that VLMLight reduces waiting times for emergency vehicles by up to 65% over RL-only systems, while preserving real-time performance in standard conditions with less than 1% degradation. VLMLight offers a scalable, interpretable, and safety-aware solution for next-generation traffic signal control.

Paper Structure

This paper contains 48 sections, 11 equations, 15 figures, 8 tables, 1 algorithm.

Figures (15)

  • Figure 1: Illustration of a four-way intersection with four signal phases. The simulator supports multi-view visual inputs, including a bird’s-eye view (left) and directional views from each approach (right), enabling lane-level observation of vehicle movements. In this example, the North-facing camera captures a fire truck traversing the intersection, highlighting the simulator’s ability to support safety-critical reasoning through perceptually grounded traffic understanding.
  • Figure 2: VLMLight architecture. Multi-view intersection images are first parsed by a VLM agent for scene understanding, after which a safety-prioritized LLM meta-controller interprets the scene and selects either a fast RL policy (orange) for routine traffic flow or a collaborative reasoning deliberative policy (blue) for safety-critical scenarios. A team of LLM agents—Phase Reasoning, Signal Planning, and Rule Verification—sequentially assess traffic phases, vehicle priority, and rule compliance to determine the final action $a_t$, giving real-time control and robust handling of complex events.
  • Figure 3: Illustration of $\texttt{Agent} \textsubscript{Scene}$ in VLMLight. Given three-view images from a T-junction (left), a VLM-based Scene Description Agent generates directional-level textual summaries ($T_1$, $T_2$, $T_3$), describing lane semantics, congestion, and special vehicle presence. These summaries are then aggregated into phase-level descriptions ($P$) based on predefined signal phase mappings.
  • Figure 4: Three real-world intersections, each shown with three image modalities: (a) Songdo (South Korea), (b) Yau Ma Tei (Hong Kong), and (c) Massy (France). For each site, the satellite view is on the left, SUMO simulation in the middle, and our simulator rendering on the right.
  • Figure 5: Example of zero-padding at the Yau Ma Tei intersection.
  • ...and 10 more figures