Table of Contents
Fetching ...

Detection of Micromobility Vehicles in Urban Traffic Videos

Khalil Sabri, Célia Djilali, Guillaume-Alexandre Bilodeau, Nicolas Saunier, Wassim Bouachir

TL;DR

This work introduces an adapted detection model that combines the accuracy and speed of single-frame object detection with the richer features offered by video object detection frameworks by applying aggregated feature maps from consecutive frames processed through motion flow to the YOLOX architecture.

Abstract

Urban traffic environments present unique challenges for object detection, particularly with the increasing presence of micromobility vehicles like e-scooters and bikes. To address this object detection problem, this work introduces an adapted detection model that combines the accuracy and speed of single-frame object detection with the richer features offered by video object detection frameworks. This is done by applying aggregated feature maps from consecutive frames processed through motion flow to the YOLOX architecture. This fusion brings a temporal perspective to YOLOX detection abilities, allowing for a better understanding of urban mobility patterns and substantially improving detection reliability. Tested on a custom dataset curated for urban micromobility scenarios, our model showcases substantial improvement over existing state-of-the-art methods, demonstrating the need to consider spatio-temporal information for detecting such small and thin objects. Our approach enhances detection in challenging conditions, including occlusions, ensuring temporal consistency, and effectively mitigating motion blur.

Detection of Micromobility Vehicles in Urban Traffic Videos

TL;DR

This work introduces an adapted detection model that combines the accuracy and speed of single-frame object detection with the richer features offered by video object detection frameworks by applying aggregated feature maps from consecutive frames processed through motion flow to the YOLOX architecture.

Abstract

Urban traffic environments present unique challenges for object detection, particularly with the increasing presence of micromobility vehicles like e-scooters and bikes. To address this object detection problem, this work introduces an adapted detection model that combines the accuracy and speed of single-frame object detection with the richer features offered by video object detection frameworks. This is done by applying aggregated feature maps from consecutive frames processed through motion flow to the YOLOX architecture. This fusion brings a temporal perspective to YOLOX detection abilities, allowing for a better understanding of urban mobility patterns and substantially improving detection reliability. Tested on a custom dataset curated for urban micromobility scenarios, our model showcases substantial improvement over existing state-of-the-art methods, demonstrating the need to consider spatio-temporal information for detecting such small and thin objects. Our approach enhances detection in challenging conditions, including occlusions, ensuring temporal consistency, and effectively mitigating motion blur.
Paper Structure (15 sections, 9 equations, 4 figures, 2 tables)

This paper contains 15 sections, 9 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: FGFA-YOLOX detection framework overview. Features from input frames are first extracted with the backbone of YOLOX. The optical flows of the current frame with past and future frames (neighbour frames) are also computed. For temporal aggregation, the motion-adjusted features of the neighbour frames are aggregated with those of the current frame. The aggregated features are then processes through the YOLOX architecture neck and head for detection.
  • Figure 2: Examples of annotated images in PolyMMV. Class 0: Bicycles, Class 1: Skateboards, Class 2: Electric Scooters
  • Figure 3: Characteristics of the training dataset of PolyMMV. Top left: the number of instances distribution of bicycles, skateboards, and e-scooters, top right: illustration of the size distribution of bounding boxes, bottom left and right depict the scatter plots of normalized bounding box positions and sizes, respectively.
  • Figure 4: Comparative analysis of model performance in various scenarios. A) Occlusion, B) Motion blur, and C) Temporal consistency.