Table of Contents
Fetching ...

DAVE: Diverse Atomic Visual Elements Dataset with High Representation of Vulnerable Road Users in Complex and Unpredictable Environments

Xijun Wang, Pedro Sandoval-Segura, Chengyuan Zhang, Junyun Huang, Tianrui Guan, Ruiqi Xian, Fuxiao Liu, Rohan Chandra, Boqing Gong, Dinesh Manocha

TL;DR

DAVE tackles the gap between Western, structured traffic datasets and real-world, unstructured Asian traffic by providing a large, richly annotated dataset focused on Vulnerable Road Users (VRUs) gathered in India. It features 16 actor categories and 16 action types, with over 13 million bounding boxes and 1.6 million actor-action annotations, enabling comprehensive evaluation across Tracking, Detection, Video Moment Retrieval, Spatiotemporal Action Localization, and Multi-label Video Action Recognition. Across tasks, DAVE proves more challenging than existing datasets, revealing degradation of state-of-the-art methods and highlighting the need for diverse, global data to improve robustness in complex driving environments. The dataset’s breadth, dense annotations, and variability support more sensitive perception algorithms that generalize to real-world, high-density VRU interactions, with clear implications for safety-critical autonomous driving systems.

Abstract

Most existing traffic video datasets including Waymo are structured, focusing predominantly on Western traffic, which hinders global applicability. Specifically, most Asian scenarios are far more complex, involving numerous objects with distinct motions and behaviors. Addressing this gap, we present a new dataset, DAVE, designed for evaluating perception methods with high representation of Vulnerable Road Users (VRUs: e.g. pedestrians, animals, motorbikes, and bicycles) in complex and unpredictable environments. DAVE is a manually annotated dataset encompassing 16 diverse actor categories (spanning animals, humans, vehicles, etc.) and 16 action types (complex and rare cases like cut-ins, zigzag movement, U-turn, etc.), which require high reasoning ability. DAVE densely annotates over 13 million bounding boxes (bboxes) actors with identification, and more than 1.6 million boxes are annotated with both actor identification and action/behavior details. The videos within DAVE are collected based on a broad spectrum of factors, such as weather conditions, the time of day, road scenarios, and traffic density. DAVE can benchmark video tasks like Tracking, Detection, Spatiotemporal Action Localization, Language-Visual Moment retrieval, and Multi-label Video Action Recognition. Given the critical importance of accurately identifying VRUs to prevent accidents and ensure road safety, in DAVE, vulnerable road users constitute 41.13% of instances, compared to 23.71% in Waymo. DAVE provides an invaluable resource for the development of more sensitive and accurate visual perception algorithms in the complex real world. Our experiments show that existing methods suffer degradation in performance when evaluated on DAVE, highlighting its benefit for future video recognition research.

DAVE: Diverse Atomic Visual Elements Dataset with High Representation of Vulnerable Road Users in Complex and Unpredictable Environments

TL;DR

DAVE tackles the gap between Western, structured traffic datasets and real-world, unstructured Asian traffic by providing a large, richly annotated dataset focused on Vulnerable Road Users (VRUs) gathered in India. It features 16 actor categories and 16 action types, with over 13 million bounding boxes and 1.6 million actor-action annotations, enabling comprehensive evaluation across Tracking, Detection, Video Moment Retrieval, Spatiotemporal Action Localization, and Multi-label Video Action Recognition. Across tasks, DAVE proves more challenging than existing datasets, revealing degradation of state-of-the-art methods and highlighting the need for diverse, global data to improve robustness in complex driving environments. The dataset’s breadth, dense annotations, and variability support more sensitive perception algorithms that generalize to real-world, high-density VRU interactions, with clear implications for safety-critical autonomous driving systems.

Abstract

Most existing traffic video datasets including Waymo are structured, focusing predominantly on Western traffic, which hinders global applicability. Specifically, most Asian scenarios are far more complex, involving numerous objects with distinct motions and behaviors. Addressing this gap, we present a new dataset, DAVE, designed for evaluating perception methods with high representation of Vulnerable Road Users (VRUs: e.g. pedestrians, animals, motorbikes, and bicycles) in complex and unpredictable environments. DAVE is a manually annotated dataset encompassing 16 diverse actor categories (spanning animals, humans, vehicles, etc.) and 16 action types (complex and rare cases like cut-ins, zigzag movement, U-turn, etc.), which require high reasoning ability. DAVE densely annotates over 13 million bounding boxes (bboxes) actors with identification, and more than 1.6 million boxes are annotated with both actor identification and action/behavior details. The videos within DAVE are collected based on a broad spectrum of factors, such as weather conditions, the time of day, road scenarios, and traffic density. DAVE can benchmark video tasks like Tracking, Detection, Spatiotemporal Action Localization, Language-Visual Moment retrieval, and Multi-label Video Action Recognition. Given the critical importance of accurately identifying VRUs to prevent accidents and ensure road safety, in DAVE, vulnerable road users constitute 41.13% of instances, compared to 23.71% in Waymo. DAVE provides an invaluable resource for the development of more sensitive and accurate visual perception algorithms in the complex real world. Our experiments show that existing methods suffer degradation in performance when evaluated on DAVE, highlighting its benefit for future video recognition research.
Paper Structure (39 sections, 3 figures, 9 tables)

This paper contains 39 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Tasks Overview. We use DAVE for various video recognition tasks, including Tracking, Detection, Video Moment Retrieval, Spatiotemporal Action Localization, and Multi-label Video Action Recognition. Our large-scale dataset is made up of complex environments that are densely annotated. Each bounding box (bbox) corresponds to an actor, and the text above each bbox serves as either the tracking ID or indicates the associated action.
  • Figure 2: Challenging Characteristics of DAVE: These videos correspond to different times of the day with different brightness, different geographical landforms from city and rural areas, high density and unpredictable road conditions, diverse actors including humans, animals, vehicles, etc.
  • Figure 3: Annotation Statistic. The actor and action distribution for DAVE, includes a wide-ranging and rich taxonomy of 16 agents and 16 action categories. This dual focus on both the breadth of agent and action types and the depth of instances allows for more robust and effective training of video recognition models.