Table of Contents
Fetching ...

An Optimized YOLOv5 Based Approach For Real-time Vehicle Detection At Road Intersections Using Fisheye Cameras

Md. Jahin Alam, Muhammad Zubair Hasan, Md Maisoon Rahman, Md Awsafur Rahman, Najibul Haque Sarker, Shariar Azad, Tasnim Nishat Islam, Bishmoy Paul, Tanvir Anjum, Barproda Halder, Shaikh Anowarul Fattah

TL;DR

This work tackles real-time vehicle detection at road intersections using fisheye cameras by adapting YOLOv5 with a dedicated day-night separator, challenging-image upsampling, and a multi-stage transfer learning pipeline. It integrates pseudo labeling and selectively ensembles multiple trained weights to improve localization and reduce false detections under varying lighting and distortion conditions. Evaluations on the VIP Cup 2020 fisheye dataset show a notable $mAP@0.5$ improvement over standard YOLOv5 and demonstrate real-time feasibility, despite ground-truth inconsistencies in the dataset. The approach offers practical benefits for urban traffic surveillance by expanding field-of-view coverage while maintaining accuracy and speed.

Abstract

Real time vehicle detection is a challenging task for urban traffic surveillance. Increase in urbanization leads to increase in accidents and traffic congestion in junction areas resulting in delayed travel time. In order to solve these problems, an intelligent system utilizing automatic detection and tracking system is significant. But this becomes a challenging task at road intersection areas which require a wide range of field view. For this reason, fish eye cameras are widely used in real time vehicle detection purpose to provide large area coverage and 360 degree view at junctions. However, it introduces challenges such as light glare from vehicles and street lights, shadow, non-linear distortion, scaling issues of vehicles and proper localization of small vehicles. To overcome each of these challenges, a modified YOLOv5 object detection scheme is proposed. YOLOv5 is a deep learning oriented convolutional neural network (CNN) based object detection method. The proposed scheme for detecting vehicles in fish-eye images consists of a light-weight day-night CNN classifier so that two different solutions can be implemented to address the day-night detection issues. Furthurmore, challenging instances are upsampled in the dataset for proper localization of vehicles and later on the detection model is ensembled and trained in different combination of vehicle datasets for better generalization, detection and accuracy. For testing, a real world fisheye dataset provided by the Video and Image Processing (VIP) Cup organizer ISSD has been used which includes images from video clips of different fisheye cameras at junction of different cities during day and night time. Experimental results show that our proposed model has outperformed the YOLOv5 model on the dataset by 13.7% mAP @ 0.5.

An Optimized YOLOv5 Based Approach For Real-time Vehicle Detection At Road Intersections Using Fisheye Cameras

TL;DR

This work tackles real-time vehicle detection at road intersections using fisheye cameras by adapting YOLOv5 with a dedicated day-night separator, challenging-image upsampling, and a multi-stage transfer learning pipeline. It integrates pseudo labeling and selectively ensembles multiple trained weights to improve localization and reduce false detections under varying lighting and distortion conditions. Evaluations on the VIP Cup 2020 fisheye dataset show a notable improvement over standard YOLOv5 and demonstrate real-time feasibility, despite ground-truth inconsistencies in the dataset. The approach offers practical benefits for urban traffic surveillance by expanding field-of-view coverage while maintaining accuracy and speed.

Abstract

Real time vehicle detection is a challenging task for urban traffic surveillance. Increase in urbanization leads to increase in accidents and traffic congestion in junction areas resulting in delayed travel time. In order to solve these problems, an intelligent system utilizing automatic detection and tracking system is significant. But this becomes a challenging task at road intersection areas which require a wide range of field view. For this reason, fish eye cameras are widely used in real time vehicle detection purpose to provide large area coverage and 360 degree view at junctions. However, it introduces challenges such as light glare from vehicles and street lights, shadow, non-linear distortion, scaling issues of vehicles and proper localization of small vehicles. To overcome each of these challenges, a modified YOLOv5 object detection scheme is proposed. YOLOv5 is a deep learning oriented convolutional neural network (CNN) based object detection method. The proposed scheme for detecting vehicles in fish-eye images consists of a light-weight day-night CNN classifier so that two different solutions can be implemented to address the day-night detection issues. Furthurmore, challenging instances are upsampled in the dataset for proper localization of vehicles and later on the detection model is ensembled and trained in different combination of vehicle datasets for better generalization, detection and accuracy. For testing, a real world fisheye dataset provided by the Video and Image Processing (VIP) Cup organizer ISSD has been used which includes images from video clips of different fisheye cameras at junction of different cities during day and night time. Experimental results show that our proposed model has outperformed the YOLOv5 model on the dataset by 13.7% mAP @ 0.5.

Paper Structure

This paper contains 32 sections, 2 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Graphical Abstract of the proposed scheme: The Challenging subset of images are up-sampled in number and combined with the rest before creating data partitions (1, 2, 3, 4, 5). From this, sequential partitions are defined and each sequence is passed onto the implemented Pipeline which generalizes the weights of the core model. Scores or evaluation metrics produced by each sequence are looked over carefully and the best ones are selected for ensemble.
  • Figure 2: Backbone Architecture of Scheme (a). The provided input image is trained through a sequential pipeline consisting of Focus (b), Conv Blocks (c), BottleNeck CSP (d) and SPP blocks (e); from which three particular feature maps i.e: A, B, C are extracted and passed onto the next stage processing which is the Neck.
  • Figure 3: Implemented Core Model
  • Figure 4: The Neck Architecture: Path Aggregation Network (PANet)
  • Figure 5: Day-Night Separator Model
  • ...and 6 more figures