Table of Contents
Fetching ...

Combining YOLO and Visual Rhythm for Vehicle Counting

Victor Nascimento Ribeiro, Nina S. T. Hirata

TL;DR

This work tackles efficient vehicle counting in videos captured by static cameras by eliminating the traditional tracking stage and instead using Visual Rhythm to produce time-spatial images that highlight frame segments containing useful information. A YOLO-based detector is applied to marks detected within these VR images, followed by frame extraction and verification to map marks to the corresponding vehicles, with a mechanism to prevent double counting across VR segments. The approach achieves a mean counting accuracy of about $99.15\%$ and runs roughly three times faster than frame-by-frame tracking methods on segments of $T=900$ frames, demonstrating strong efficiency gains with minimal loss in accuracy. While the method is not real-time, it provides a practical framework for unidirectional vehicle counting and can benefit from transfer learning and future extensions to more vehicle classes.

Abstract

Video-based vehicle detection and counting play a critical role in managing transport infrastructure. Traditional image-based counting methods usually involve two main steps: initial detection and subsequent tracking, which are applied to all video frames, leading to a significant increase in computational complexity. To address this issue, this work presents an alternative and more efficient method for vehicle detection and counting. The proposed approach eliminates the need for a tracking step and focuses solely on detecting vehicles in key video frames, thereby increasing its efficiency. To achieve this, we developed a system that combines YOLO, for vehicle detection, with Visual Rhythm, a way to create time-spatial images that allows us to focus on frames that contain useful information. Additionally, this method can be used for counting in any application involving unidirectional moving targets to be detected and identified. Experimental analysis using real videos shows that the proposed method achieves mean counting accuracy around 99.15% over a set of videos, with a processing speed three times faster than tracking based approaches.

Combining YOLO and Visual Rhythm for Vehicle Counting

TL;DR

This work tackles efficient vehicle counting in videos captured by static cameras by eliminating the traditional tracking stage and instead using Visual Rhythm to produce time-spatial images that highlight frame segments containing useful information. A YOLO-based detector is applied to marks detected within these VR images, followed by frame extraction and verification to map marks to the corresponding vehicles, with a mechanism to prevent double counting across VR segments. The approach achieves a mean counting accuracy of about and runs roughly three times faster than frame-by-frame tracking methods on segments of frames, demonstrating strong efficiency gains with minimal loss in accuracy. While the method is not real-time, it provides a practical framework for unidirectional vehicle counting and can benefit from transfer learning and future extensions to more vehicle classes.

Abstract

Video-based vehicle detection and counting play a critical role in managing transport infrastructure. Traditional image-based counting methods usually involve two main steps: initial detection and subsequent tracking, which are applied to all video frames, leading to a significant increase in computational complexity. To address this issue, this work presents an alternative and more efficient method for vehicle detection and counting. The proposed approach eliminates the need for a tracking step and focuses solely on detecting vehicles in key video frames, thereby increasing its efficiency. To achieve this, we developed a system that combines YOLO, for vehicle detection, with Visual Rhythm, a way to create time-spatial images that allows us to focus on frames that contain useful information. Additionally, this method can be used for counting in any application involving unidirectional moving targets to be detected and identified. Experimental analysis using real videos shows that the proposed method achieves mean counting accuracy around 99.15% over a set of videos, with a processing speed three times faster than tracking based approaches.
Paper Structure (10 sections, 4 figures, 3 tables)

This paper contains 10 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Visual Rhythm Generation.
  • Figure 2: Data flow in the VR–based video counting vehicles.
  • Figure 3: Vehicle represented by distinct marks in two consecutive VR images.
  • Figure 4: Images samples from the dataset