Table of Contents
Fetching ...

D4: Text-guided diffusion model-based domain adaptive data augmentation for vineyard shoot detection

Kentaro Hirahara, Chikahito Nakane, Hajime Ebisawa, Tsuyoshi Kuroda, Yohei Iwaki, Tomoyoshi Utsumi, Yuichiro Nomura, Makoto Koike, Hiroshi Mineno

TL;DR

D4 addresses data scarcity and domain diversity in vineyard shoot phenotyping by leveraging a two-stage, text-guided diffusion model to generate domain-adaptive annotated images from unlabeled video frames and a small annotated set. Stage 1 learns broad domain features from edge-based inputs, while Stage 2 learns local, coordinate-based features from annotation-plotted images, enabling robust domain transfer (e.g., nighttime to daytime). An automatic DreamSim-based selection mechanism and prompt engineering ensure high-quality generated samples, leading to significant improvements in both bounding-box and keypoint detection (e.g., up to $28.65\%$ mAP and $13.73\%$ AP gains). The results demonstrate the method’s potential to reduce annotation effort while improving generalization, with insights into the importance of image quality and prompts for future work in broader agricultural and other domain adaptations.

Abstract

In an agricultural field, plant phenotyping using object detection models is gaining attention. However, collecting the training data necessary to create generic and high-precision models is extremely challenging due to the difficulty of annotation and the diversity of domains. Furthermore, it is difficult to transfer training data across different crops, and although machine learning models effective for specific environments, conditions, or crops have been developed, they cannot be widely applied in actual fields. In this study, we propose a generative data augmentation method (D4) for vineyard shoot detection. D4 uses a pre-trained text-guided diffusion model based on a large number of original images culled from video data collected by unmanned ground vehicles or other means, and a small number of annotated datasets. The proposed method generates new annotated images with background information adapted to the target domain while retaining annotation information necessary for object detection. In addition, D4 overcomes the lack of training data in agriculture, including the difficulty of annotation and diversity of domains. We confirmed that this generative data augmentation method improved the mean average precision by up to 28.65% for the BBox detection task and the average precision by up to 13.73% for the keypoint detection task for vineyard shoot detection. Our generative data augmentation method D4 is expected to simultaneously solve the cost and domain diversity issues of training data generation in agriculture and improve the generalization performance of detection models.

D4: Text-guided diffusion model-based domain adaptive data augmentation for vineyard shoot detection

TL;DR

D4 addresses data scarcity and domain diversity in vineyard shoot phenotyping by leveraging a two-stage, text-guided diffusion model to generate domain-adaptive annotated images from unlabeled video frames and a small annotated set. Stage 1 learns broad domain features from edge-based inputs, while Stage 2 learns local, coordinate-based features from annotation-plotted images, enabling robust domain transfer (e.g., nighttime to daytime). An automatic DreamSim-based selection mechanism and prompt engineering ensure high-quality generated samples, leading to significant improvements in both bounding-box and keypoint detection (e.g., up to mAP and AP gains). The results demonstrate the method’s potential to reduce annotation effort while improving generalization, with insights into the importance of image quality and prompts for future work in broader agricultural and other domain adaptations.

Abstract

In an agricultural field, plant phenotyping using object detection models is gaining attention. However, collecting the training data necessary to create generic and high-precision models is extremely challenging due to the difficulty of annotation and the diversity of domains. Furthermore, it is difficult to transfer training data across different crops, and although machine learning models effective for specific environments, conditions, or crops have been developed, they cannot be widely applied in actual fields. In this study, we propose a generative data augmentation method (D4) for vineyard shoot detection. D4 uses a pre-trained text-guided diffusion model based on a large number of original images culled from video data collected by unmanned ground vehicles or other means, and a small number of annotated datasets. The proposed method generates new annotated images with background information adapted to the target domain while retaining annotation information necessary for object detection. In addition, D4 overcomes the lack of training data in agriculture, including the difficulty of annotation and diversity of domains. We confirmed that this generative data augmentation method improved the mean average precision by up to 28.65% for the BBox detection task and the average precision by up to 13.73% for the keypoint detection task for vineyard shoot detection. Our generative data augmentation method D4 is expected to simultaneously solve the cost and domain diversity issues of training data generation in agriculture and improve the generalization performance of detection models.
Paper Structure (45 sections, 2 equations, 18 figures, 5 tables)

This paper contains 45 sections, 2 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Challenges in annotation tasks for vineyard cultivation (a) Detection and counting of very small inflorescences (b) Shape complexity and partial occlusion in shoot detection (c) Influence of background vegetation and similar textures in images taken during daylight
  • Figure 2: Domain diversity challenges in agriculture (a) Domain diversity in the agricultural field (b) Changes in appearance characteristics attributed to plant growth
  • Figure 3: Overview of the dataset used in this study (a) Data collection environment (b) Frame extraction from video data (c) Annotation definitions for BBox and keypoints
  • Figure 4: Basic framework and key components of D4
  • Figure 5: Pre-training the text-guided diffusion model (a) Stage 1: Learning broad features in the proprietary dataset (b) Stage 2: Learning local features in the proprietary dataset
  • ...and 13 more figures