Table of Contents
Fetching ...

WTS: A Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding

Quan Kong, Yuki Kawana, Rajat Saini, Ashutosh Kumar, Jingjing Pan, Ta Gu, Yohei Ozao, Balazs Opra, David C. Anastasiu, Yoichi Sato, Norimasa Kobori

TL;DR

The paper introduces WTS, a pedestrian-centric traffic video dataset designed for fine-grained spatiotemporal understanding, augmented with multi-view video, 3D gaze data, and rich textual captions across 255 scenarios. It also proposes a new LLM-based evaluation metric (LLMScore) and an instance-aware VideoLLM baseline to enable more precise, instance-level video captioning in traffic contexts. Through extensive experiments, the work demonstrates that prompt design strongly affects traditional metrics, while fine-tuning and instance conditioning can improve performance, with LLMScore aligning closely with human judgments. Collectively, WTS provides a substantial resource and methodology to advance pedestrian-vehicle interaction understanding, gaze-informed analysis, and safety-oriented autonomous driving research.

Abstract

In this paper, we address the challenge of fine-grained video event understanding in traffic scenarios, vital for autonomous driving and safety. Traditional datasets focus on driver or vehicle behavior, often neglecting pedestrian perspectives. To fill this gap, we introduce the WTS dataset, highlighting detailed behaviors of both vehicles and pedestrians across over 1.2k video events in hundreds of traffic scenarios. WTS integrates diverse perspectives from vehicle ego and fixed overhead cameras in a vehicle-infrastructure cooperative environment, enriched with comprehensive textual descriptions and unique 3D Gaze data for a synchronized 2D/3D view, focusing on pedestrian analysis. We also pro-vide annotations for 5k publicly sourced pedestrian-related traffic videos. Additionally, we introduce LLMScorer, an LLM-based evaluation metric to align inference captions with ground truth. Using WTS, we establish a benchmark for dense video-to-text tasks, exploring state-of-the-art Vision-Language Models with an instance-aware VideoLLM method as a baseline. WTS aims to advance fine-grained video event understanding, enhancing traffic safety and autonomous driving development.

WTS: A Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding

TL;DR

The paper introduces WTS, a pedestrian-centric traffic video dataset designed for fine-grained spatiotemporal understanding, augmented with multi-view video, 3D gaze data, and rich textual captions across 255 scenarios. It also proposes a new LLM-based evaluation metric (LLMScore) and an instance-aware VideoLLM baseline to enable more precise, instance-level video captioning in traffic contexts. Through extensive experiments, the work demonstrates that prompt design strongly affects traditional metrics, while fine-tuning and instance conditioning can improve performance, with LLMScore aligning closely with human judgments. Collectively, WTS provides a substantial resource and methodology to advance pedestrian-vehicle interaction understanding, gaze-informed analysis, and safety-oriented autonomous driving research.

Abstract

In this paper, we address the challenge of fine-grained video event understanding in traffic scenarios, vital for autonomous driving and safety. Traditional datasets focus on driver or vehicle behavior, often neglecting pedestrian perspectives. To fill this gap, we introduce the WTS dataset, highlighting detailed behaviors of both vehicles and pedestrians across over 1.2k video events in hundreds of traffic scenarios. WTS integrates diverse perspectives from vehicle ego and fixed overhead cameras in a vehicle-infrastructure cooperative environment, enriched with comprehensive textual descriptions and unique 3D Gaze data for a synchronized 2D/3D view, focusing on pedestrian analysis. We also pro-vide annotations for 5k publicly sourced pedestrian-related traffic videos. Additionally, we introduce LLMScorer, an LLM-based evaluation metric to align inference captions with ground truth. Using WTS, we establish a benchmark for dense video-to-text tasks, exploring state-of-the-art Vision-Language Models with an instance-aware VideoLLM method as a baseline. WTS aims to advance fine-grained video event understanding, enhancing traffic safety and autonomous driving development.
Paper Structure (15 sections, 2 equations, 9 figures, 3 tables)

This paper contains 15 sections, 2 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: The overview of WTS dataset features. We provide multi-view videos with fine-grained video captions focusing on pedestrian behavior and the 3D gaze and location information for a further detailed understanding of the traffic-related videos.
  • Figure 2: A full caption example with its structure design for the [action] phase
  • Figure 3: The overview of WTS video caption data structure: 1) the left figure shows multiple views from overhead to ego vehicle view with 5 phases. 2) the right figure shows the definition of our phase segment and the GT captions corresponding with action segment about the target pedestrian and vehicle respectively as an example.
  • Figure 4: (a).Sample of scenario pattern. 3 frames are sampled from the video along the temporal direction with the order 1 to 3 at the upper left of the frame. (b).Our recording environment map and camera position.
  • Figure 5: Annotation pipeline for generating the traffic domain-related captions.
  • ...and 4 more figures