Table of Contents
Fetching ...

A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS

Juan Terven, Diana Cordova-Esparza

TL;DR

This paper provides a thorough, section-by-section survey of the YOLO family from the original YOLOv1 to YOLOv8, YOLO-NAS, and YOLO with Transformers. It analyzes architectural innovations (backbones, necks, heads), training tricks, anchor usage, and NAS-driven designs, while contrasting performance on benchmarks like VOC and COCO and highlighting speed-accuracy tradeoffs. The review also covers ancillary efforts such as PP-YOLO variants, YOLOR, YOLOX, and DAMO-YOLO, illustrating a progression toward anchor-free designs, advanced label assignment, and quantization-aware inference. By synthesizing architectural patterns, training practices, and empirical results, the paper provides guidance for selecting YOLO variants for real-time detection in diverse applications and outlines potential future directions for the evolution of fast and accurate object detectors.

Abstract

YOLO has become a central real-time object detection system for robotics, driverless cars, and video monitoring applications. We present a comprehensive analysis of YOLO's evolution, examining the innovations and contributions in each iteration from the original YOLO up to YOLOv8, YOLO-NAS, and YOLO with Transformers. We start by describing the standard metrics and postprocessing; then, we discuss the major changes in network architecture and training tricks for each model. Finally, we summarize the essential lessons from YOLO's development and provide a perspective on its future, highlighting potential research directions to enhance real-time object detection systems.

A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS

TL;DR

This paper provides a thorough, section-by-section survey of the YOLO family from the original YOLOv1 to YOLOv8, YOLO-NAS, and YOLO with Transformers. It analyzes architectural innovations (backbones, necks, heads), training tricks, anchor usage, and NAS-driven designs, while contrasting performance on benchmarks like VOC and COCO and highlighting speed-accuracy tradeoffs. The review also covers ancillary efforts such as PP-YOLO variants, YOLOR, YOLOX, and DAMO-YOLO, illustrating a progression toward anchor-free designs, advanced label assignment, and quantization-aware inference. By synthesizing architectural patterns, training practices, and empirical results, the paper provides guidance for selecting YOLO variants for real-time detection in diverse applications and outlines potential future directions for the evolution of fast and accurate object detectors.

Abstract

YOLO has become a central real-time object detection system for robotics, driverless cars, and video monitoring applications. We present a comprehensive analysis of YOLO's evolution, examining the innovations and contributions in each iteration from the original YOLO up to YOLOv8, YOLO-NAS, and YOLO with Transformers. We start by describing the standard metrics and postprocessing; then, we discuss the major changes in network architecture and training tricks for each model. Finally, we summarize the essential lessons from YOLO's development and provide a perspective on its future, highlighting potential research directions to enhance real-time object detection systems.
Paper Structure (41 sections, 21 figures, 4 tables, 1 algorithm)

This paper contains 41 sections, 21 figures, 4 tables, 1 algorithm.

Figures (21)

  • Figure 1: A timeline of YOLO versions.
  • Figure 2: Bibliometric network visualization of the main YOLO Applications created with VOSviewer_Visualizing_Scientific_Landscapes_2023.
  • Figure 3: Intersection over Union (IoU). a) The IoU is calculated by dividing the intersection of the two boxes by the union of the boxes; b) examples of three different IoU values for different box locations.
  • Figure 4: Non-Maximum Suppression (NMS). a) Shows the typical output of an object detection model containing multiple overlapping boxes. b) Shows the output after NMS.
  • Figure 5: YOLO output prediction. The figure depicts a simplified YOLO model with a three-by-three grid, three classes, and a single class prediction per grid element to produce a vector of eight values.
  • ...and 16 more figures