Table of Contents
Fetching ...

ROADWork: A Dataset and Benchmark for Learning to Recognize, Observe, Analyze and Drive Through Work Zones

Anurag Ghosh, Shen Zheng, Robert Tamburo, Khiem Vuong, Juan Alvarez-Padilla, Hailiang Zhu, Michael Cardei, Nicholas Dunn, Christoph Mertz, Srinivasa G. Narasimhan

TL;DR

ROADWork introduces the first large-scale, multimodal dataset and benchmark focused on perception and navigation in road-work zones. By combining richly annotated images, videos, scene descriptions, and 2D/3D pathways across diverse geographies, it reveals that open-vocabulary foundation models underperform on work-zone objects and signs, while targeted fine-tuning yields substantial gains in detection, sign reading, and pathway reasoning. The work demonstrates practical improvements via simple techniques (video label propagation, crop-rescale, and object-context augmentation) and shows that ROADWork enables robust long-horizon navigation through work zones, with concrete gains in angular error and path prediction reliability. Overall, ROADWork provides a critical resource for advancing perception and planning in a challenging, underrepresented driving scenario and supports geographic-domain adaptation studies.

Abstract

Perceiving and autonomously navigating through work zones is a challenging and underexplored problem. Open datasets for this long-tailed scenario are scarce. We propose the ROADWork dataset to learn to recognize, observe, analyze, and drive through work zones. State-of-the-art foundation models fail when applied to work zones. Fine-tuning models on our dataset significantly improves perception and navigation in work zones. With ROADWork dataset, we discover new work zone images with higher precision (+32.5%) at a much higher rate (12.8$\times$) around the world. Open-vocabulary methods fail too, whereas fine-tuned detectors improve performance (+32.2 AP). Vision-Language Models (VLMs) struggle to describe work zones, but fine-tuning substantially improves performance (+36.7 SPICE). Beyond fine-tuning, we show the value of simple techniques. Video label propagation provides additional gains (+2.6 AP) for instance segmentation. While reading work zone signs, composing a detector and text spotter via crop-scaling improves performance +14.2% 1-NED). Composing work zone detections to provide context further reduces hallucinations (+3.9 SPICE) in VLMs. We predict navigational goals and compute drivable paths from work zone videos. Incorporating road work semantics ensures 53.6% goals have angular error (AE) < 0.5 (+9.9 %) and 75.3% pathways have AE < 0.5 (+8.1 %).

ROADWork: A Dataset and Benchmark for Learning to Recognize, Observe, Analyze and Drive Through Work Zones

TL;DR

ROADWork introduces the first large-scale, multimodal dataset and benchmark focused on perception and navigation in road-work zones. By combining richly annotated images, videos, scene descriptions, and 2D/3D pathways across diverse geographies, it reveals that open-vocabulary foundation models underperform on work-zone objects and signs, while targeted fine-tuning yields substantial gains in detection, sign reading, and pathway reasoning. The work demonstrates practical improvements via simple techniques (video label propagation, crop-rescale, and object-context augmentation) and shows that ROADWork enables robust long-horizon navigation through work zones, with concrete gains in angular error and path prediction reliability. Overall, ROADWork provides a critical resource for advancing perception and planning in a challenging, underrepresented driving scenario and supports geographic-domain adaptation studies.

Abstract

Perceiving and autonomously navigating through work zones is a challenging and underexplored problem. Open datasets for this long-tailed scenario are scarce. We propose the ROADWork dataset to learn to recognize, observe, analyze, and drive through work zones. State-of-the-art foundation models fail when applied to work zones. Fine-tuning models on our dataset significantly improves perception and navigation in work zones. With ROADWork dataset, we discover new work zone images with higher precision (+32.5%) at a much higher rate (12.8) around the world. Open-vocabulary methods fail too, whereas fine-tuned detectors improve performance (+32.2 AP). Vision-Language Models (VLMs) struggle to describe work zones, but fine-tuning substantially improves performance (+36.7 SPICE). Beyond fine-tuning, we show the value of simple techniques. Video label propagation provides additional gains (+2.6 AP) for instance segmentation. While reading work zone signs, composing a detector and text spotter via crop-scaling improves performance +14.2% 1-NED). Composing work zone detections to provide context further reduces hallucinations (+3.9 SPICE) in VLMs. We predict navigational goals and compute drivable paths from work zone videos. Incorporating road work semantics ensures 53.6% goals have angular error (AE) < 0.5 (+9.9 %) and 75.3% pathways have AE < 0.5 (+8.1 %).
Paper Structure (18 sections, 1 equation, 18 figures, 23 tables)

This paper contains 18 sections, 1 equation, 18 figures, 23 tables.

Figures (18)

  • Figure 1: Autonomous Driving In Work Zones. Work zone objects block the road (in red), making some road regions unsafe (red arrows), necessitating the development of forecasting traversable paths from driven paths (green dots). Prior datasets do not comprehensively address this long-tailed challenge. Foundation models struggle with recognizing and interpreting work zone objects and signs, discovering new work zones, and analyzing work zones. We additionally formulate work zone navigation as a learnable task. The ROADWork dataset and benchmark highlight key challenges in work zone perception and navigation, and we demonstrate simple techniques that improve performance on these tasks.
  • Figure 2: The ROADWork Dataset consists of work zone videos and images. We have segmented 15 object categories such as workers, vehicles and barriers. We provide object attributes for signs and arrow boards to enable fine-grained understanding. Our work zone scene descriptions analyze the scene globally and one passable trajectory automatically estimated from the associated video sequences learns how to drive through work zones. See Appendix \ref{['sup:dataset']} for more details.
  • Figure 3: Generating Driving Trajectory. Using driving video frames (a-c) as input, we estimated camera poses using COLMAP schoenberger2016sfm (d-f), where camera poses are depicted as viewing frustums shown in red. These poses were projected onto the estimated ground plane to form the driven trajectory, then back-projected onto the frames as the future driving trajectory (g).
  • Figure 4: Challenges And Variations In Work Zone Objects. There exists a significant geographical variation in work zone objects and work vehicles in the ROADWork dataset. For instance, the appearance of barriers, vertical panels, and tabular markers (first row) varies across cities. Similarly, work vehicles (second row) demonstrate large variations, as they are specialized for specific tasks within the work zones.
  • Figure 5: Work Zones Discovered Around The World. Mapillary neuhold2017mapillary dataset contains driving images from around the world. Despite the ROADWork training dataset images being restricted to the U.S., we discovered work zones captured in Europe and Asia.
  • ...and 13 more figures