Table of Contents
Fetching ...

Scaling Traffic Insights with AI and Language Model-Powered Camera Systems for Data-Driven Transportation Decision Making

Fan Zuo, Donglin Zhou, Jingqin Gao, Kaan Ozbay

TL;DR

The paper tackles scalable, data-driven citywide traffic monitoring using existing CCTV networks, addressing challenges from non-stationary PTZ cameras and massive video data volumes. It proposes an end-to-end framework combining viewpoint normalization via homography clustering, a fine-tuned YOLOv11 detector for urban traffic, and a domain-aware LLM module for automated summaries, validated on NYC congestion pricing data comprising about $9{,}000{,}000$ images from roughly $1{,}000$ cameras. Key contributions include the graph-based viewpoint normalization, object-detection performance with $mAP@0.5 = 0.788$, and the Multimodal Density Tracker dashboard enabling spatiotemporal analyses; the prompts demonstrate improved numerical fidelity and reduced hallucinations with domain exemplars. Case study on NYC shows a $9\%$ drop in weekday passenger-car density inside the CRZ, early truck-density reductions up to $19.5\%$, and rising pedestrian and cyclist activity, illustrating policy-relevant insights and scalable deployment with minimal human input. This demonstrates practical policy insights and the feasibility of large-scale deployment using existing ITS infrastructure for real-time, data-driven transportation decision making.

Abstract

Accurate, scalable traffic monitoring is critical for real-time and long-term transportation management, particularly during disruptions such as natural disasters, large construction projects, or major policy changes like New York City's first-in-the-nation congestion pricing program. However, widespread sensor deployment remains limited due to high installation, maintenance, and data management costs. While traffic cameras offer a cost-effective alternative, existing video analytics struggle with dynamic camera viewpoints and massive data volumes from large camera networks. This study presents an end-to-end AI-based framework leveraging existing traffic camera infrastructure for high-resolution, longitudinal analysis at scale. A fine-tuned YOLOv11 model, trained on localized urban scenes, extracts multimodal traffic density and classification metrics in real time. To address inconsistencies from non-stationary pan-tilt-zoom cameras, we introduce a novel graph-based viewpoint normalization method. A domain-specific large language model was also integrated to process massive data from a 24/7 video stream to generate frequent, automated summaries of evolving traffic patterns, a task far exceeding manual capabilities. We validated the system using over 9 million images from roughly 1,000 traffic cameras during the early rollout of NYC congestion pricing in 2025. Results show a 9% decline in weekday passenger vehicle density within the Congestion Relief Zone, early truck volume reductions with signs of rebound, and consistent increases in pedestrian and cyclist activity at corridor and zonal scales. Experiments showed that example-based prompts improved LLM's numerical accuracy and reduced hallucinations. These findings demonstrate the framework's potential as a practical, infrastructure-ready solution for large-scale, policy-relevant traffic monitoring with minimal human intervention.

Scaling Traffic Insights with AI and Language Model-Powered Camera Systems for Data-Driven Transportation Decision Making

TL;DR

The paper tackles scalable, data-driven citywide traffic monitoring using existing CCTV networks, addressing challenges from non-stationary PTZ cameras and massive video data volumes. It proposes an end-to-end framework combining viewpoint normalization via homography clustering, a fine-tuned YOLOv11 detector for urban traffic, and a domain-aware LLM module for automated summaries, validated on NYC congestion pricing data comprising about images from roughly cameras. Key contributions include the graph-based viewpoint normalization, object-detection performance with , and the Multimodal Density Tracker dashboard enabling spatiotemporal analyses; the prompts demonstrate improved numerical fidelity and reduced hallucinations with domain exemplars. Case study on NYC shows a drop in weekday passenger-car density inside the CRZ, early truck-density reductions up to , and rising pedestrian and cyclist activity, illustrating policy-relevant insights and scalable deployment with minimal human input. This demonstrates practical policy insights and the feasibility of large-scale deployment using existing ITS infrastructure for real-time, data-driven transportation decision making.

Abstract

Accurate, scalable traffic monitoring is critical for real-time and long-term transportation management, particularly during disruptions such as natural disasters, large construction projects, or major policy changes like New York City's first-in-the-nation congestion pricing program. However, widespread sensor deployment remains limited due to high installation, maintenance, and data management costs. While traffic cameras offer a cost-effective alternative, existing video analytics struggle with dynamic camera viewpoints and massive data volumes from large camera networks. This study presents an end-to-end AI-based framework leveraging existing traffic camera infrastructure for high-resolution, longitudinal analysis at scale. A fine-tuned YOLOv11 model, trained on localized urban scenes, extracts multimodal traffic density and classification metrics in real time. To address inconsistencies from non-stationary pan-tilt-zoom cameras, we introduce a novel graph-based viewpoint normalization method. A domain-specific large language model was also integrated to process massive data from a 24/7 video stream to generate frequent, automated summaries of evolving traffic patterns, a task far exceeding manual capabilities. We validated the system using over 9 million images from roughly 1,000 traffic cameras during the early rollout of NYC congestion pricing in 2025. Results show a 9% decline in weekday passenger vehicle density within the Congestion Relief Zone, early truck volume reductions with signs of rebound, and consistent increases in pedestrian and cyclist activity at corridor and zonal scales. Experiments showed that example-based prompts improved LLM's numerical accuracy and reduced hallucinations. These findings demonstrate the framework's potential as a practical, infrastructure-ready solution for large-scale, policy-relevant traffic monitoring with minimal human intervention.

Paper Structure

This paper contains 23 sections, 5 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Overall architecture of the proposed methodology
  • Figure 2: Illustrative examples of PTZ camera viewpoint variations: Case 1 – camera switches to a different roadway approach and later returns; Case 2 – same approach but with altered tilt or zoom, creating partial overlap with the original view; Case 3 – camera temporarily out of service.
  • Figure 3: (a) Overview of the proposed viewpoint normalization framework. (b) Example SIFT matches (green lines).
  • Figure 4: LLM‑Augmented Traffic Summarization Workflow.
  • Figure 5: NYC traffic camera network and congestion pricing toll zone
  • ...and 7 more figures