Table of Contents
Fetching ...

CycleCrash: A Dataset of Bicycle Collision Videos for Collision Prediction and Analysis

Nishq Poorav Desai, Ali Etemad, Michael Greenspan

TL;DR

Vid-NeXt is proposed, a novel method that leverages a ConvNeXt spatial encoder and a non-stationary transformer to capture the temporal dynamics of videos for the tasks defined in the authors' dataset.

Abstract

Self-driving research often underrepresents cyclist collisions and safety. To address this, we present CycleCrash, a novel dataset consisting of 3,000 dashcam videos with 436,347 frames that capture cyclists in a range of critical situations, from collisions to safe interactions. This dataset enables 9 different cyclist collision prediction and classification tasks focusing on potentially hazardous conditions for cyclists and is annotated with collision-related, cyclist-related, and scene-related labels. Next, we propose VidNeXt, a novel method that leverages a ConvNeXt spatial encoder and a non-stationary transformer to capture the temporal dynamics of videos for the tasks defined in our dataset. To demonstrate the effectiveness of our method and create additional baselines on CycleCrash, we apply and compare 7 models along with a detailed ablation. We release the dataset and code at https://github.com/DeSinister/CycleCrash/ .

CycleCrash: A Dataset of Bicycle Collision Videos for Collision Prediction and Analysis

TL;DR

Vid-NeXt is proposed, a novel method that leverages a ConvNeXt spatial encoder and a non-stationary transformer to capture the temporal dynamics of videos for the tasks defined in the authors' dataset.

Abstract

Self-driving research often underrepresents cyclist collisions and safety. To address this, we present CycleCrash, a novel dataset consisting of 3,000 dashcam videos with 436,347 frames that capture cyclists in a range of critical situations, from collisions to safe interactions. This dataset enables 9 different cyclist collision prediction and classification tasks focusing on potentially hazardous conditions for cyclists and is annotated with collision-related, cyclist-related, and scene-related labels. Next, we propose VidNeXt, a novel method that leverages a ConvNeXt spatial encoder and a non-stationary transformer to capture the temporal dynamics of videos for the tasks defined in our dataset. To demonstrate the effectiveness of our method and create additional baselines on CycleCrash, we apply and compare 7 models along with a detailed ablation. We release the dataset and code at https://github.com/DeSinister/CycleCrash/ .
Paper Structure (12 sections, 13 figures, 5 tables)

This paper contains 12 sections, 13 figures, 5 tables.

Figures (13)

  • Figure 1: A few samples from CycleCrash showcasing various cyclist-related interactions with different vehicles along with collision severity levels.
  • Figure 2: Sample frames from 3 video clips along with descriptions and annotations from the CycleCrash dataset.
  • Figure 3: Distribution of CycleCrash data for (i) time-to-collision, (ii) duration of video clips, (iii) other objects involved, (iv) behaviour risk index, (v) age, and fault.
  • Figure 4: Relationship between (i) direction of cyclists and objects involved in collisions, (ii) fault and age.
  • Figure 5: (a) The architecture of the proposed method, VidNeXt is presented. First, ConvNeXt is used to encode the video frames. Next, the frame embeddings are normalized for stationarity. Non-stationary information is reintegrated into the transformer blocks via rescaling factors $\tau$ and $\Delta$, determined by the projector based on the previous normalization layer. (b) Architectural details of transformer with de-stationary attention.
  • ...and 8 more figures