Table of Contents
Fetching ...

YOLOv12: Attention-Centric Real-Time Object Detectors

Yunjie Tian, Qixiang Ye, David Doermann

TL;DR

This work tackles the tension between speed and accuracy in real-time object detection by infusing YOLO with attention while preserving CNN-style efficiency. It introduces Area Attention (A2) to reduce self-attention cost, Residual Efficient Layer Aggregation Networks (R-ELAN) to stabilize optimization, and targeted architectural refinements (e.g., removing positional encodings, adjusting MLP ratio) to fit the YOLO pipeline. Through extensive COCO experiments across five model scales, YOLOv12 achieves state-of-the-art latency-accuracy trade-offs, outperforming existing real-time detectors and RT-DETR variants on multiple metrics and hardware configurations. The results demonstrate that attention-centric designs can meet stringent real-time requirements without pretraining, signaling a path for future high-performance, efficient detectors.

Abstract

Enhancing the network architecture of the YOLO framework has been crucial for a long time, but has focused on CNN-based improvements despite the proven superiority of attention mechanisms in modeling capabilities. This is because attention-based models cannot match the speed of CNN-based models. This paper proposes an attention-centric YOLO framework, namely YOLOv12, that matches the speed of previous CNN-based ones while harnessing the performance benefits of attention mechanisms. YOLOv12 surpasses all popular real-time object detectors in accuracy with competitive speed. For example, YOLOv12-N achieves 40.6% mAP with an inference latency of 1.64 ms on a T4 GPU, outperforming advanced YOLOv10-N / YOLOv11-N by 2.1%/1.2% mAP with a comparable speed. This advantage extends to other model scales. YOLOv12 also surpasses end-to-end real-time detectors that improve DETR, such as RT-DETR / RT-DETRv2: YOLOv12-S beats RT-DETR-R18 / RT-DETRv2-R18 while running 42% faster, using only 36% of the computation and 45% of the parameters. More comparisons are shown in Figure 1.

YOLOv12: Attention-Centric Real-Time Object Detectors

TL;DR

This work tackles the tension between speed and accuracy in real-time object detection by infusing YOLO with attention while preserving CNN-style efficiency. It introduces Area Attention (A2) to reduce self-attention cost, Residual Efficient Layer Aggregation Networks (R-ELAN) to stabilize optimization, and targeted architectural refinements (e.g., removing positional encodings, adjusting MLP ratio) to fit the YOLO pipeline. Through extensive COCO experiments across five model scales, YOLOv12 achieves state-of-the-art latency-accuracy trade-offs, outperforming existing real-time detectors and RT-DETR variants on multiple metrics and hardware configurations. The results demonstrate that attention-centric designs can meet stringent real-time requirements without pretraining, signaling a path for future high-performance, efficient detectors.

Abstract

Enhancing the network architecture of the YOLO framework has been crucial for a long time, but has focused on CNN-based improvements despite the proven superiority of attention mechanisms in modeling capabilities. This is because attention-based models cannot match the speed of CNN-based models. This paper proposes an attention-centric YOLO framework, namely YOLOv12, that matches the speed of previous CNN-based ones while harnessing the performance benefits of attention mechanisms. YOLOv12 surpasses all popular real-time object detectors in accuracy with competitive speed. For example, YOLOv12-N achieves 40.6% mAP with an inference latency of 1.64 ms on a T4 GPU, outperforming advanced YOLOv10-N / YOLOv11-N by 2.1%/1.2% mAP with a comparable speed. This advantage extends to other model scales. YOLOv12 also surpasses end-to-end real-time detectors that improve DETR, such as RT-DETR / RT-DETRv2: YOLOv12-S beats RT-DETR-R18 / RT-DETRv2-R18 while running 42% faster, using only 36% of the computation and 45% of the parameters. More comparisons are shown in Figure 1.

Paper Structure

This paper contains 16 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Comparisons with other popular methods in terms of latency-accuracy (left) and FLOPs-accuracy (right) trade-offs.
  • Figure 2: Comparison of the representative local attention mechanisms with our area attention. Area Attention adopts the most straightforward equal partitioning way to divide the feature map into $l$ areas vertically or horizontally. (default is 4). This avoids complex operations while ensuring a large receptive field, resulting in high efficiency.
  • Figure 3: The architecture comparison with popular modules including (a): CSPNet wang2020cspnet, (b) ELAN wang2022designing_elan, (c) C3K2 (a case of GELAN) wang2024yolov9jocher2024yolov11, and (d) the proposed R-ELAN (residual efficient layer aggregation networks).
  • Figure 4: Comparison with popular methods in terms of accuracy-parameters (left) and accuracy-latency trade-off on CPU (right).
  • Figure 5: Comparison of heat maps between YOLOv10 wang2024yolov10, YOLOv11 jocher2024yolov11, and the proposed YOLOv12. Compared to the advanced YOLOv10 and YOLOv11, YOLOv12 demonstrates a clearer perception of objects in the image. All the results are obtained using the X scale models. Zoom in to compare the details.