Table of Contents
Fetching ...

RF-DETR Object Detection vs YOLOv12 : A Study of Transformer-based and CNN-based Architectures for Single-Class and Multi-Class Greenfruit Detection in Complex Orchard Environments Under Label Ambiguity

Ranjan Sapkota, Rahul Harsha Cheppally, Ajay Sharda, Manoj Karkee

TL;DR

This study benchmarks transformer-based RF-DETR against CNN-based YOLOv12 for greenfruit detection in complex orchards with occlusion and label ambiguity. Using a field-collected dataset of 857 RGB images annotated for single-class and multi-class scenarios, RF-DETR delivers the highest $mAP@50$ in single-class detection ($0.9464$) and strong multi-class performance ($mAP@50=0.8298$), while YOLOv12 variants achieve competitive $mAP@50:95$ results, notably $0.7620$ for YOLOv12N. Training dynamics reveal rapid convergence for RF-DETR (under 10 epochs for single-class; ~20 for multi-class) compared with the full 100 epochs required by YOLOv12 variants, highlighting transformer efficiency in dynamic visual data. The results suggest RF-DETR is better suited for precise localization in cluttered scenes, whereas YOLOv12 is advantageous for fast, edge-embedded deployment, offering a practical guide for precision agriculture deployments under label ambiguity and occlusion.

Abstract

This study conducts a detailed comparison of RF-DETR object detection base model and YOLOv12 object detection model configurations for detecting greenfruits in a complex orchard environment marked by label ambiguity, occlusions, and background blending. A custom dataset was developed featuring both single-class (greenfruit) and multi-class (occluded and non-occluded greenfruits) annotations to assess model performance under dynamic real-world conditions. RF-DETR object detection model, utilizing a DINOv2 backbone and deformable attention, excelled in global context modeling, effectively identifying partially occluded or ambiguous greenfruits. In contrast, YOLOv12 leveraged CNN-based attention for enhanced local feature extraction, optimizing it for computational efficiency and edge deployment. RF-DETR achieved the highest mean Average Precision (mAP50) of 0.9464 in single-class detection, proving its superior ability to localize greenfruits in cluttered scenes. Although YOLOv12N recorded the highest mAP@50:95 of 0.7620, RF-DETR consistently outperformed in complex spatial scenarios. For multi-class detection, RF-DETR led with an mAP@50 of 0.8298, showing its capability to differentiate between occluded and non-occluded fruits, while YOLOv12L scored highest in mAP@50:95 with 0.6622, indicating better classification in detailed occlusion contexts. Training dynamics analysis highlighted RF-DETR's swift convergence, particularly in single-class settings where it plateaued within 10 epochs, demonstrating the efficiency of transformer-based architectures in adapting to dynamic visual data. These findings validate RF-DETR's effectiveness for precision agricultural applications, with YOLOv12 suited for fast-response scenarios. >Index Terms: RF-DETR object detection, YOLOv12, YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO World, YOLO, You Only Look Once, Roboflow, Detection Transformers, CNNs

RF-DETR Object Detection vs YOLOv12 : A Study of Transformer-based and CNN-based Architectures for Single-Class and Multi-Class Greenfruit Detection in Complex Orchard Environments Under Label Ambiguity

TL;DR

This study benchmarks transformer-based RF-DETR against CNN-based YOLOv12 for greenfruit detection in complex orchards with occlusion and label ambiguity. Using a field-collected dataset of 857 RGB images annotated for single-class and multi-class scenarios, RF-DETR delivers the highest in single-class detection () and strong multi-class performance (), while YOLOv12 variants achieve competitive results, notably for YOLOv12N. Training dynamics reveal rapid convergence for RF-DETR (under 10 epochs for single-class; ~20 for multi-class) compared with the full 100 epochs required by YOLOv12 variants, highlighting transformer efficiency in dynamic visual data. The results suggest RF-DETR is better suited for precise localization in cluttered scenes, whereas YOLOv12 is advantageous for fast, edge-embedded deployment, offering a practical guide for precision agriculture deployments under label ambiguity and occlusion.

Abstract

This study conducts a detailed comparison of RF-DETR object detection base model and YOLOv12 object detection model configurations for detecting greenfruits in a complex orchard environment marked by label ambiguity, occlusions, and background blending. A custom dataset was developed featuring both single-class (greenfruit) and multi-class (occluded and non-occluded greenfruits) annotations to assess model performance under dynamic real-world conditions. RF-DETR object detection model, utilizing a DINOv2 backbone and deformable attention, excelled in global context modeling, effectively identifying partially occluded or ambiguous greenfruits. In contrast, YOLOv12 leveraged CNN-based attention for enhanced local feature extraction, optimizing it for computational efficiency and edge deployment. RF-DETR achieved the highest mean Average Precision (mAP50) of 0.9464 in single-class detection, proving its superior ability to localize greenfruits in cluttered scenes. Although YOLOv12N recorded the highest mAP@50:95 of 0.7620, RF-DETR consistently outperformed in complex spatial scenarios. For multi-class detection, RF-DETR led with an mAP@50 of 0.8298, showing its capability to differentiate between occluded and non-occluded fruits, while YOLOv12L scored highest in mAP@50:95 with 0.6622, indicating better classification in detailed occlusion contexts. Training dynamics analysis highlighted RF-DETR's swift convergence, particularly in single-class settings where it plateaued within 10 epochs, demonstrating the efficiency of transformer-based architectures in adapting to dynamic visual data. These findings validate RF-DETR's effectiveness for precision agricultural applications, with YOLOv12 suited for fast-response scenarios. >Index Terms: RF-DETR object detection, YOLOv12, YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO World, YOLO, You Only Look Once, Roboflow, Detection Transformers, CNNs

Paper Structure

This paper contains 18 sections, 6 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Classification of object detection methodologies: Top features state-of-the-art CNN-based and Transformer-based methods, widely adopted; Vision Language Models are emerging. Also includes Hybrid, Sparse Coding, and Traditional Feature-based approaches.
  • Figure 2: CNN vs Transformer-based model performance comparison focusing on YOLOv12 (CNN-based) and RF-DETR (Transformer-based) architectures: (a) RF-DETR object detection model benchmark evaluation with YOLO11, YOLOv8 and other DETR-based object detection models ; (b)RF-DETR evaluation on the RF100-VL dataset, highlighting domain adaptability and edge deployment potential. ; and (c) Performance overview of recent CNN-based models, includ- ing YOLOv6 through YOLOv12, Gold-YOLO, RT-DETR, RT-DETRv2, and YOLO-MS. b) RF-DETR benchmark results on the MS COCO dataset, surpassing 60% mAP
  • Figure 3: Overview of data collection setup and environment: a) Flow diagram showing the methodology of RF-DETR vs YOLOv12 comparision ; b) Map highlighting the study location in Prosser, Washington, USA ; c) of 'Scifresh' apple trees, known as Jazz apples; d) The robotic platform used for image acquisition, featuring an Intel RGB-D camera mounted on a UR5e robotic arm, capturing images of immature greenfruits in complex orchard environment.
  • Figure 4: (a) RF-DETR Architecture diagram for object detection ; (b) YOLOv12 Architecture Diagram for object detection
  • Figure 5: Visual comparison of single-class greenfruit detection using RF-DETR and YOLOv12 in complex orchard scenes. a) Three clustered greenfruits partially occluded by dense canopy; RF-DETR detected all, YOLOv12 missed one. b) A camouflaged greenfruit blending into the canopy; RF-DETR correctly detected it, YOLOv12 failed. c) A heavily occluded greenfruit with only the calyx visible under low light; RF-DETR identified it, YOLOv12 missed detection.
  • ...and 3 more figures