Table of Contents
Fetching ...

YOLOv1 to YOLOv10: The fastest and most accurate real-time object detection systems

Chien-Yao Wang, Hong-Yuan Mark Liao

TL;DR

This survey traces the decade-long evolution of the YOLO family from YOLOv1 to YOLOv10, highlighting core design philosophies that enabled real-time, edge-friendly object detection. It analyzes architectural innovations, training techniques, and label-assignment strategies that yielded increasing speed and accuracy, while also enabling cross-domain extensions to tracking, segmentation, driving, pose, 3D perception, and open-vocabulary tasks. The paper underscores YOLO's influence on subsequent CV research and its role as a versatile platform for integrating with transformers, NAS, multimodal models, and lightweight hardware-focused designs. It provides a structured view of how simpler, faster, and stronger YOLO variants have driven practical deployment and inspired broader developments in computer vision and language-model-enabled perception.

Abstract

This is a comprehensive review of the YOLO series of systems. Different from previous literature surveys, this review article re-examines the characteristics of the YOLO series from the latest technical point of view. At the same time, we also analyzed how the YOLO series continued to influence and promote real-time computer vision-related research and led to the subsequent development of computer vision and language models.We take a closer look at how the methods proposed by the YOLO series in the past ten years have affected the development of subsequent technologies and show the applications of YOLO in various fields. We hope this article can play a good guiding role in subsequent real-time computer vision development.

YOLOv1 to YOLOv10: The fastest and most accurate real-time object detection systems

TL;DR

This survey traces the decade-long evolution of the YOLO family from YOLOv1 to YOLOv10, highlighting core design philosophies that enabled real-time, edge-friendly object detection. It analyzes architectural innovations, training techniques, and label-assignment strategies that yielded increasing speed and accuracy, while also enabling cross-domain extensions to tracking, segmentation, driving, pose, 3D perception, and open-vocabulary tasks. The paper underscores YOLO's influence on subsequent CV research and its role as a versatile platform for integrating with transformers, NAS, multimodal models, and lightweight hardware-focused designs. It provides a structured view of how simpler, faster, and stronger YOLO variants have driven practical deployment and inspired broader developments in computer vision and language-model-enabled perception.

Abstract

This is a comprehensive review of the YOLO series of systems. Different from previous literature surveys, this review article re-examines the characteristics of the YOLO series from the latest technical point of view. At the same time, we also analyzed how the YOLO series continued to influence and promote real-time computer vision-related research and led to the subsequent development of computer vision and language models.We take a closer look at how the methods proposed by the YOLO series in the past ten years have affected the development of subsequent technologies and show the applications of YOLO in various fields. We hope this article can play a good guiding role in subsequent real-time computer vision development.
Paper Structure (37 sections, 3 equations, 14 figures)

This paper contains 37 sections, 3 equations, 14 figures.

Figures (14)

  • Figure 1: Architecture of YOLOv1.
  • Figure 2: Architecture of YOLOv2.
  • Figure 3: Architecture of YOLOv3, YOLOv5, and PP-YOLO.
  • Figure 4: Architecture of Gaussian YOLOv3.
  • Figure 5: Architecture of YOLOv4, Scaled-YOLOv4, YOLOv5 r1--r7, and PP-YOLOv2.
  • ...and 9 more figures