Assessing the Capability of YOLO- and Transformer-based Object Detectors for Real-time Weed Detection
Alicia Allmendinger, Ahmet Oğuz Saltık, Gerassimos G. Peteinatos, Anthony Stein, Roland Gerhards
TL;DR
This work benchmarks state-of-the-art one-stage YOLO variants (v8–v10) against transformer-based RT-DETR for real-time weed detection in field conditions. By using two labeling schemes (per-species and monocot/dicot grouping) on a 5611-image field dataset, it analyzes precision, recall, and mAP across two hardware setups, revealing that RT-DETR-l achieves the highest precision while YOLOv9s/e offer strong recall and robust mAP, especially in the grouped dataset. Importantly, the smallest YOLO variants deliver sub-100 ms inference times on modern GPUs, supporting deployment on embedded devices, whereas RT-DETR provides favorable false-positive control for spot-spraying scenarios. The study highlights practical deployment considerations, including dataset composition, hardware constraints, and the potential to combine these models with site-specific weed management tools to advance sustainable agriculture and EU Green Deal objectives. Future work includes field integration and exploring synthetic data to broaden background variability and improve generalization.
Abstract
Spot spraying represents an efficient and sustainable method for reducing the amount of pesticides, particularly herbicides, used in agricultural fields. To achieve this, it is of utmost importance to reliably differentiate between crops and weeds, and even between individual weed species in situ and under real-time conditions. To assess suitability for real-time application, different object detection models that are currently state-of-the-art are compared. All available models of YOLOv8, YOLOv9, YOLOv10, and RT-DETR are trained and evaluated with images from a real field situation. The images are separated into two distinct datasets: In the initial data set, each species of plants is trained individually; in the subsequent dataset, a distinction is made between monocotyledonous weeds, dicotyledonous weeds, and three chosen crops. The results demonstrate that while all models perform equally well in the metrics evaluated, the YOLOv9 models, particularly the YOLOv9s and YOLOv9e, stand out in terms of their strong recall scores (66.58 % and 72.36 %), as well as mAP50 (73.52 % and 79.86 %), and mAP50-95 (43.82 % and 47.00 %) in dataset 2. However, the RT-DETR models, especially RT-DETR-l, excel in precision with reaching 82.44 \% on dataset 1 and 81.46 % in dataset 2, making them particularly suitable for scenarios where minimizing false positives is critical. In particular, the smallest variants of the YOLO models (YOLOv8n, YOLOv9t, and YOLOv10n) achieve substantially faster inference times down to 7.58 ms for dataset 2 on the NVIDIA GeForce RTX 4090 GPU for analyzing one frame, while maintaining competitive accuracy, highlighting their potential for deployment in resource-constrained embedded computing devices as typically used in productive setups.
