Table of Contents
Fetching ...

First qualitative observations on deep learning vision model YOLO and DETR for automated driving in Austria

Stefan Schoder

TL;DR

This paper presents a qualitative study of fast deep-learning vision models for automated driving, comparing YOLO variants (v2, v3, v5, v8) and RT-DETR on US and Austrian road scenes. It highlights the strengths of these models in detecting common objects like cars, while exposing weaknesses in small object, traffic-sign, and winter-scene recognition, particularly under alpine conditions and snow-occluded signs. The work emphasizes the need for region-specific fine-tuning, robust data, and potential benefits from multi-modal sensor fusion to improve safety-critical perception. The findings establish a foundation for subsequent quantitative benchmarks and inform the development of robust, region-aware ADAS and autonomous driving systems on diverse road networks.

Abstract

This study investigates the application of single and two-stage 2D-object detection algorithms like You Only Look Once (YOLO), Real-Time DEtection TRansformer (RT-DETR) algorithm for automated object detection to enhance road safety for autonomous driving on Austrian roads. The YOLO algorithm is a state-of-the-art real-time object detection system known for its efficiency and accuracy. In the context of driving, its potential to rapidly identify and track objects is crucial for advanced driver assistance systems (ADAS) and autonomous vehicles. The research focuses on the unique challenges posed by the road conditions and traffic scenarios in Austria. The country's diverse landscape, varying weather conditions, and specific traffic regulations necessitate a tailored approach for reliable object detection. The study utilizes a selective dataset comprising images and videos captured on Austrian roads, encompassing urban, rural, and alpine environments.

First qualitative observations on deep learning vision model YOLO and DETR for automated driving in Austria

TL;DR

This paper presents a qualitative study of fast deep-learning vision models for automated driving, comparing YOLO variants (v2, v3, v5, v8) and RT-DETR on US and Austrian road scenes. It highlights the strengths of these models in detecting common objects like cars, while exposing weaknesses in small object, traffic-sign, and winter-scene recognition, particularly under alpine conditions and snow-occluded signs. The work emphasizes the need for region-specific fine-tuning, robust data, and potential benefits from multi-modal sensor fusion to improve safety-critical perception. The findings establish a foundation for subsequent quantitative benchmarks and inform the development of robust, region-aware ADAS and autonomous driving systems on diverse road networks.

Abstract

This study investigates the application of single and two-stage 2D-object detection algorithms like You Only Look Once (YOLO), Real-Time DEtection TRansformer (RT-DETR) algorithm for automated object detection to enhance road safety for autonomous driving on Austrian roads. The YOLO algorithm is a state-of-the-art real-time object detection system known for its efficiency and accuracy. In the context of driving, its potential to rapidly identify and track objects is crucial for advanced driver assistance systems (ADAS) and autonomous vehicles. The research focuses on the unique challenges posed by the road conditions and traffic scenarios in Austria. The country's diverse landscape, varying weather conditions, and specific traffic regulations necessitate a tailored approach for reliable object detection. The study utilizes a selective dataset comprising images and videos captured on Austrian roads, encompassing urban, rural, and alpine environments.
Paper Structure (21 sections, 1 equation, 12 figures)

This paper contains 21 sections, 1 equation, 12 figures.

Figures (12)

  • Figure 1: The first generation YOLO architecture redmon2016you.
  • Figure 2: "Overview of RT-DETR. We first leverage features of the last three stages of the backbone $\{S3 , S4 , S5 \}$ as the input to the encoder. The efficient hybrid encoder transforms multi-scale features into a sequence of image features through intra-scale feature interaction (AIFI) and cross-scale feature-fusion module (CCFM). The $IoU$-aware query selection is employed to select a fixed number of image features to serve as initial object queries for the decoder. Finally, the decoder with auxiliary prediction heads iteratively optimizes object queries to generate boxes and confidence scores."lv2023detrs.
  • Figure 3: Compared to other real-time object detectors, RT-DETR achieves state-of-the-art performance in both speed and accuracy lv2023detrs.
  • Figure 4: Deep learning vision object detection models applied to one example scene in CA.
  • Figure 5: Deep learning vision object detection models applied to two example scene in AUT.
  • ...and 7 more figures