Table of Contents
Fetching ...

Improved YOLOv12 with LLM-Generated Synthetic Data for Enhanced Apple Detection and Benchmarking Against YOLOv11 and YOLOv10

Ranjan Sapkota, Manoj Karkee

TL;DR

This work addresses the need for scalable, robust apple-detection in orchards by training YOLOv12 exclusively on synthetic images generated via Large Language Models, then benchmarking against YOLOv11 and YOLOv10. The authors integrate LLM-generated datasets with architectural innovations (Area Attention, Residual ELAN, 7×7 depth-wise conv) and show that YOLOv12n achieves the best metrics on synthetic data ($P=$0.916, $R=$0.969, $mAP@50=$0.978). Field validation with real Orchard images demonstrates strong generalization, with YOLOv12n maintaining superior performance relative to predecessors, while also highlighting faster inference in certain configurations. Overall, the study demonstrates that synthetic data can replace extensive field data collection for training high-performing agricultural detectors, enabling scalable, real-time apple detection in commercial operations.

Abstract

This study evaluated the performance of the YOLOv12 object detection model, and compared against the performances YOLOv11 and YOLOv10 for apple detection in commercial orchards based on the model training completed entirely on synthetic images generated by Large Language Models (LLMs). The YOLOv12n configuration achieved the highest precision at 0.916, the highest recall at 0.969, and the highest mean Average Precision (mAP@50) at 0.978. In comparison, the YOLOv11 series was led by YOLO11x, which achieved the highest precision at 0.857, recall at 0.85, and mAP@50 at 0.91. For the YOLOv10 series, YOLOv10b and YOLOv10l both achieved the highest precision at 0.85, with YOLOv10n achieving the highest recall at 0.8 and mAP@50 at 0.89. These findings demonstrated that YOLOv12, when trained on realistic LLM-generated datasets surpassed its predecessors in key performance metrics. The technique also offered a cost-effective solution by reducing the need for extensive manual data collection in the agricultural field. In addition, this study compared the computational efficiency of all versions of YOLOv12, v11 and v10, where YOLOv11n reported the lowest inference time at 4.7 ms, compared to YOLOv12n's 5.6 ms and YOLOv10n's 5.9 ms. Although YOLOv12 is new and more accurate than YOLOv11, and YOLOv10, YOLO11n still stays the fastest YOLO model among YOLOv10, YOLOv11 and YOLOv12 series of models. (Index: YOLOv12, YOLOv11, YOLOv10, YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO Object detection)

Improved YOLOv12 with LLM-Generated Synthetic Data for Enhanced Apple Detection and Benchmarking Against YOLOv11 and YOLOv10

TL;DR

This work addresses the need for scalable, robust apple-detection in orchards by training YOLOv12 exclusively on synthetic images generated via Large Language Models, then benchmarking against YOLOv11 and YOLOv10. The authors integrate LLM-generated datasets with architectural innovations (Area Attention, Residual ELAN, 7×7 depth-wise conv) and show that YOLOv12n achieves the best metrics on synthetic data (0.916, 0.969, 0.978). Field validation with real Orchard images demonstrates strong generalization, with YOLOv12n maintaining superior performance relative to predecessors, while also highlighting faster inference in certain configurations. Overall, the study demonstrates that synthetic data can replace extensive field data collection for training high-performing agricultural detectors, enabling scalable, real-time apple detection in commercial operations.

Abstract

This study evaluated the performance of the YOLOv12 object detection model, and compared against the performances YOLOv11 and YOLOv10 for apple detection in commercial orchards based on the model training completed entirely on synthetic images generated by Large Language Models (LLMs). The YOLOv12n configuration achieved the highest precision at 0.916, the highest recall at 0.969, and the highest mean Average Precision (mAP@50) at 0.978. In comparison, the YOLOv11 series was led by YOLO11x, which achieved the highest precision at 0.857, recall at 0.85, and mAP@50 at 0.91. For the YOLOv10 series, YOLOv10b and YOLOv10l both achieved the highest precision at 0.85, with YOLOv10n achieving the highest recall at 0.8 and mAP@50 at 0.89. These findings demonstrated that YOLOv12, when trained on realistic LLM-generated datasets surpassed its predecessors in key performance metrics. The technique also offered a cost-effective solution by reducing the need for extensive manual data collection in the agricultural field. In addition, this study compared the computational efficiency of all versions of YOLOv12, v11 and v10, where YOLOv11n reported the lowest inference time at 4.7 ms, compared to YOLOv12n's 5.6 ms and YOLOv10n's 5.9 ms. Although YOLOv12 is new and more accurate than YOLOv11, and YOLOv10, YOLO11n still stays the fastest YOLO model among YOLOv10, YOLOv11 and YOLOv12 series of models. (Index: YOLOv12, YOLOv11, YOLOv10, YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO Object detection)

Paper Structure

This paper contains 13 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Summarizing the process of image generation to train YOLOv12 object detection model using synthetic images. Prompt engineering used to realistic image generation for this study is explained in our previous study sapkota2024yolov10. Lower panel illustrates the architecture of YOLOv12 object detection model.
  • Figure 2: YOLOv11 Architecture Diagram
  • Figure 3: YOLOv10 Architecture Diagram
  • Figure 4: a) Precision-Recall curve for YOLOv12n (Highest Performing Model out of YOLOv10, YOLOv11 and YOLOv12) showing superior detection accuracy. b) F1-score versus confidence level for YOLOv12n, indicating optimal threshold settings. c) Detection examples from DALL·E-generated images, highlighting YOLOv12n's effective apple recognition. The synthetic images, produced using LLMs like DALL·E sapkota2024syntheticsapkota2024zero, were incorporated into training datasets for YOLOv10 and YOLOv11 to mitigate real-world data limitations. These images simulate diverse challenges such as variable lighting, occlusions, clustered fruit arrangements, and complex orchard backgrounds, to enhance model adaptability. By fusing synthetic and authentic data, the models achieve improved generalization, critical for deployment in smart agriculture via machine vision sensors. This hybrid approach addresses dataset scarcity and variability, enabling precise apple detection for applications like crop health monitoring, yield prediction, and automated harvesting systems. The integration of synthetic data ensures robustness across unpredictable real-world conditions, bridging the gap between lab performance and field reliability in agricultural technology
  • Figure 5: Left: Comparison of convolution layers and GFLOPs per YOLOv12 model. Right: Parameter count (in millions) for each model configuration.
  • ...and 1 more figures