Table of Contents
Fetching ...

Zero-Shot Automatic Annotation and Instance Segmentation using LLM-Generated Datasets: Eliminating Field Imaging and Manual Annotation for Deep Learning Model Development

Ranjan Sapkota, Achyut Paudel, Manoj Karkee

TL;DR

The paper tackles the cost and logistics of data collection for agricultural instance segmentation by proposing a fully automated, zero-shot workflow that uses LLM-generated orchard images (via DALL-E) and automatic mask generation (SAMv2) to train a YOLO11 model. The approach demonstrates high-quality automatic masks with $Dice=0.9513$ and $IoU=0.9303$ on synthetic data, and strong field performance with YOLO11m-seg achieving $mask\;precision=0.902$ and $mAP@50=0.833$ on 42 real-world Azure Kinect images, indicating robust transfer from synthetic to real environments. The work reduces reliance on physical sensors and manual annotation, enabling scalable, rapid development of agricultural AI for tasks like fruit counting and robotic picking, and sets a baseline for zero-shot instance segmentation in agricultural domains. Overall, the study shows that a synthetic, automatically annotated dataset can train competitive instance segmentation models and validates their applicability in real orchards, with potential extensions to other crops and object classes.

Abstract

Currently, deep learning-based instance segmentation for various applications (e.g., Agriculture) is predominantly performed using a labor-intensive process involving extensive field data collection using sophisticated sensors, followed by careful manual annotation of images, presenting significant logistical and financial challenges to researchers and organizations. The process also slows down the model development and training process. In this study, we presented a novel method for deep learning-based instance segmentation of apples in commercial orchards that eliminates the need for labor-intensive field data collection and manual annotation. Utilizing a Large Language Model (LLM), we synthetically generated orchard images and automatically annotated them using the Segment Anything Model (SAM) integrated with a YOLO11 base model. This method significantly reduces reliance on physical sensors and manual data processing, presenting a major advancement in "Agricultural AI". The synthetic, auto-annotated dataset was used to train the YOLO11 model for Apple instance segmentation, which was then validated on real orchard images. The results showed that the automatically generated annotations achieved a Dice Coefficient of 0.9513 and an IoU of 0.9303, validating the accuracy and overlap of the mask annotations. All YOLO11 configurations, trained solely on these synthetic datasets with automated annotations, accurately recognized and delineated apples, highlighting the method's efficacy. Specifically, the YOLO11m-seg configuration achieved a mask precision of 0.902 and a mask mAP@50 of 0.833 on test images collected from a commercial orchard. Additionally, the YOLO11l-seg configuration outperformed other models in validation on 40 LLM-generated images, achieving the highest mask precision and mAP@50 metrics. Keywords: YOLO, SAM, SAMv2, YOLO11, YOLOv11, Segment Anything, YOLO-SAM

Zero-Shot Automatic Annotation and Instance Segmentation using LLM-Generated Datasets: Eliminating Field Imaging and Manual Annotation for Deep Learning Model Development

TL;DR

The paper tackles the cost and logistics of data collection for agricultural instance segmentation by proposing a fully automated, zero-shot workflow that uses LLM-generated orchard images (via DALL-E) and automatic mask generation (SAMv2) to train a YOLO11 model. The approach demonstrates high-quality automatic masks with and on synthetic data, and strong field performance with YOLO11m-seg achieving and on 42 real-world Azure Kinect images, indicating robust transfer from synthetic to real environments. The work reduces reliance on physical sensors and manual annotation, enabling scalable, rapid development of agricultural AI for tasks like fruit counting and robotic picking, and sets a baseline for zero-shot instance segmentation in agricultural domains. Overall, the study shows that a synthetic, automatically annotated dataset can train competitive instance segmentation models and validates their applicability in real orchards, with potential extensions to other crops and object classes.

Abstract

Currently, deep learning-based instance segmentation for various applications (e.g., Agriculture) is predominantly performed using a labor-intensive process involving extensive field data collection using sophisticated sensors, followed by careful manual annotation of images, presenting significant logistical and financial challenges to researchers and organizations. The process also slows down the model development and training process. In this study, we presented a novel method for deep learning-based instance segmentation of apples in commercial orchards that eliminates the need for labor-intensive field data collection and manual annotation. Utilizing a Large Language Model (LLM), we synthetically generated orchard images and automatically annotated them using the Segment Anything Model (SAM) integrated with a YOLO11 base model. This method significantly reduces reliance on physical sensors and manual data processing, presenting a major advancement in "Agricultural AI". The synthetic, auto-annotated dataset was used to train the YOLO11 model for Apple instance segmentation, which was then validated on real orchard images. The results showed that the automatically generated annotations achieved a Dice Coefficient of 0.9513 and an IoU of 0.9303, validating the accuracy and overlap of the mask annotations. All YOLO11 configurations, trained solely on these synthetic datasets with automated annotations, accurately recognized and delineated apples, highlighting the method's efficacy. Specifically, the YOLO11m-seg configuration achieved a mask precision of 0.902 and a mask mAP@50 of 0.833 on test images collected from a commercial orchard. Additionally, the YOLO11l-seg configuration outperformed other models in validation on 40 LLM-generated images, achieving the highest mask precision and mAP@50 metrics. Keywords: YOLO, SAM, SAMv2, YOLO11, YOLOv11, Segment Anything, YOLO-SAM

Paper Structure

This paper contains 17 sections, 7 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Showing a contrast in data collection methods for instance segmentation in agriculture. On the left, human workers use sophisticated sensors to collect images from orchards and engage in manual labeling, illustrating the traditional, labor-intensive process. On the right, the use of LLMs simplifies this process by generating and automatically annotating realistic images of orchards, showcasing an efficient approach.
  • Figure 2: a) Process diagram illustrating the development of a deep learning model for generation and automated annotation of synthetic apple tree images without the use of physical sensors, field data collection, or manual annotations; b) A sample image of the commercial "Scifresh" apple orchard in Prosser, Washington State, USA, where the developed model was validated, demonstrating the practical application of the synthetic image generation and automated annotation methods; c) A sensing system (including a ground robot) used as the platform for image collection in the orchard; d) Microsoft Azure Kinect DK machine vision camera used for model validation using real-world images, demonstrating the model's applicability in sensor-based, real-world systems.
  • Figure 3: Overview of the automatic annotation process using YOLO11 and SAM models: a) Zero-shot detection using YOLO11 applied to synthetic orchard images, illustrating the model's capability to identify apple instances; and b) Subsequent automatic mask annotation using SAM
  • Figure 4: YOLO11 model architecture used for detection and segmentation of apples in commercial orchards using LLM-generated and automatically annotated dataset for training
  • Figure 5: Demonstrating effective mask annotations for deep learning-based instance segmentation model training : (a) Showing LLM-generated image of an apple orchard. (b) Showing automatic labeling post-YOLO11 base model zero-shot detection using SAM : (c)Showing the comparative performance metrics between LLLM Generated Datasets and Real Field Images for instance segmentation. Metrics such as Precision, Recall, F1-Score, Dice Coefficient, and IoU.
  • ...and 6 more figures