Table of Contents
Fetching ...

A Benchmark for Cycling Close Pass Detection from Video Streams

Mingjie Li, Ben Beck, Tharindu Rathnayake, Lingheng Meng, Zijue Chen, Akansel Cosgun, Xiaojun Chang, Dana Kulić

TL;DR

This work introduces Cyc-CP, a benchmark for detecting cycling close passes from video streams, and defines two CP detection tasks: scene-level (clip-level presence) and instance-level (which vehicle causes the CP). It combines a synthetic CARLA dataset with real-world VOC data and evaluates four benchmark models, including traditional video architectures (I3D, CNN+LSTM) and a monocular 3D detector-based framework (ICD), with additional exploration of a large multimodal model (InternVideo 2.5) via prompts. On the real-world VOC data, scene-level and instance-level detections achieve $88.13\%$ and $84.60\%$ accuracy, respectively, while experiments show that RGB-only inputs generally outperform optical-flow-enhanced configurations and that alternating or finetuning strategies improve instance-level performance. The benchmark is released openly to accelerate CP detection research and inform road safety policy, with future work aiming to extend beyond CP events, incorporate additional sensors, and enhance data diversity and robustness.

Abstract

Cycling is a healthy and sustainable mode of transport. However, interactions with motor vehicles remain a key barrier to increased cycling participation. The ability to detect potentially dangerous interactions from on-bike sensing could provide important information to riders and policymakers. A key influence on rider comfort and safety is close passes, i.e., when a vehicle narrowly passes a cyclist. In this paper, we introduce a novel benchmark, called Cyc-CP, towards close pass (CP) event detection from video streams. The task is formulated into two problem categories: scene-level and instance-level. Scene-level detection ascertains the presence of a CP event within the provided video clip. Instance-level detection identifies the specific vehicle within the scene that precipitates a CP event. To address these challenges, we introduce four benchmark models, each underpinned by advanced deep-learning methodologies. For training and evaluating those models, we have developed a synthetic dataset alongside the acquisition of a real-world dataset. The benchmark evaluations reveal that the models achieve an accuracy of 88.13\% for scene-level detection and 84.60\% for instance-level detection on the real-world dataset. We envision this benchmark as a test-bed to accelerate CP detection and facilitate interaction between the fields of road safety, intelligent transportation systems and artificial intelligence. Both the benchmark datasets and detection models will be available at https://github.com/SustainableMobility/cyc-cp to facilitate experimental reproducibility and encourage more in-depth research in the field.

A Benchmark for Cycling Close Pass Detection from Video Streams

TL;DR

This work introduces Cyc-CP, a benchmark for detecting cycling close passes from video streams, and defines two CP detection tasks: scene-level (clip-level presence) and instance-level (which vehicle causes the CP). It combines a synthetic CARLA dataset with real-world VOC data and evaluates four benchmark models, including traditional video architectures (I3D, CNN+LSTM) and a monocular 3D detector-based framework (ICD), with additional exploration of a large multimodal model (InternVideo 2.5) via prompts. On the real-world VOC data, scene-level and instance-level detections achieve and accuracy, respectively, while experiments show that RGB-only inputs generally outperform optical-flow-enhanced configurations and that alternating or finetuning strategies improve instance-level performance. The benchmark is released openly to accelerate CP detection research and inform road safety policy, with future work aiming to extend beyond CP events, incorporate additional sensors, and enhance data diversity and robustness.

Abstract

Cycling is a healthy and sustainable mode of transport. However, interactions with motor vehicles remain a key barrier to increased cycling participation. The ability to detect potentially dangerous interactions from on-bike sensing could provide important information to riders and policymakers. A key influence on rider comfort and safety is close passes, i.e., when a vehicle narrowly passes a cyclist. In this paper, we introduce a novel benchmark, called Cyc-CP, towards close pass (CP) event detection from video streams. The task is formulated into two problem categories: scene-level and instance-level. Scene-level detection ascertains the presence of a CP event within the provided video clip. Instance-level detection identifies the specific vehicle within the scene that precipitates a CP event. To address these challenges, we introduce four benchmark models, each underpinned by advanced deep-learning methodologies. For training and evaluating those models, we have developed a synthetic dataset alongside the acquisition of a real-world dataset. The benchmark evaluations reveal that the models achieve an accuracy of 88.13\% for scene-level detection and 84.60\% for instance-level detection on the real-world dataset. We envision this benchmark as a test-bed to accelerate CP detection and facilitate interaction between the fields of road safety, intelligent transportation systems and artificial intelligence. Both the benchmark datasets and detection models will be available at https://github.com/SustainableMobility/cyc-cp to facilitate experimental reproducibility and encourage more in-depth research in the field.
Paper Structure (31 sections, 8 equations, 19 figures, 3 tables)

This paper contains 31 sections, 8 equations, 19 figures, 3 tables.

Figures (19)

  • Figure 1: Three samples of CP events extracted from the Victorian On-road Cycling Dataset. Video recordings capturing CP events invariably capture additional objects, such as pedestrians, vehicles, and obstacles.
  • Figure 2: Instance-level CP detection involves the detection of objects and subsequent prediction of their attributes such as actual sizes, categories, and 3D locations. CP events are then recognized based on these predicted attributes. The detected objects are displayed as red boxes in a bird's eye view, with the cyclist represented as a green circle.
  • Figure 3: Architectures of the benchmark models for scene-level CP detection.
  • Figure 4: Illustration of adapting a large multi-modal model for scene-level CP detection. We asked the model to identify a CP event and explain its choice step by step.
  • Figure 5: Comparison of videos from NuScenes driving nuScenes and our Victorian On-road Cycling scenes. Videos recorded by cameras that are mounted on bicycles facing forward are usually wobbly and truncated while driving scenes are more stable.
  • ...and 14 more figures