Table of Contents
Fetching ...

Few-Shot Adaptation of Grounding DINO for Agricultural Domain

Rajhans Singh, Rafael Bidese Puhl, Kshitiz Dhakal, Sudhir Sornapudi

TL;DR

The paper tackles annotation bottlenecks in agricultural AI by replacing the language encoder in Grounding-DINO with randomly initialized, trainable text embeddings trained in a few-shot regime. This BERT-free adaptation enables effective open-set object detection and leaf-level instance segmentation across diverse agricultural and remote-sensing datasets with as few as a handful of labeled images. Key findings show substantial $mAP$ gains over zero-shot and competitive, often superior, performance compared with fully supervised baselines under data-scarce conditions, highlighting practical potential for automated annotation and rapid domain specialization. The approach demonstrates robust cross-domain applicability, albeit with some challenges in highly occluded scenes and extremely diverse insect classes, and points to future work on adapters and integration with efficient detection backbones to broaden real-world impact.

Abstract

Deep learning models are transforming agricultural applications by enabling automated phenotyping, monitoring, and yield estimation. However, their effectiveness heavily depends on large amounts of annotated training data, which can be labor and time intensive. Recent advances in open-set object detection, particularly with models like Grounding-DINO, offer a potential solution to detect regions of interests based on text prompt input. Initial zero-shot experiments revealed challenges in crafting effective text prompts, especially for complex objects like individual leaves and visually similar classes. To address these limitations, we propose an efficient few-shot adaptation method that simplifies the Grounding-DINO architecture by removing the text encoder module (BERT) and introducing a randomly initialized trainable text embedding. This method achieves superior performance across multiple agricultural datasets, including plant-weed detection, plant counting, insect identification, fruit counting, and remote sensing tasks. Specifically, it demonstrates up to a $\sim24\%$ higher mAP than fully fine-tuned YOLO models on agricultural datasets and outperforms previous state-of-the-art methods by $\sim10\%$ in remote sensing, under few-shot learning conditions. Our method offers a promising solution for automating annotation and accelerating the development of specialized agricultural AI solutions.

Few-Shot Adaptation of Grounding DINO for Agricultural Domain

TL;DR

The paper tackles annotation bottlenecks in agricultural AI by replacing the language encoder in Grounding-DINO with randomly initialized, trainable text embeddings trained in a few-shot regime. This BERT-free adaptation enables effective open-set object detection and leaf-level instance segmentation across diverse agricultural and remote-sensing datasets with as few as a handful of labeled images. Key findings show substantial gains over zero-shot and competitive, often superior, performance compared with fully supervised baselines under data-scarce conditions, highlighting practical potential for automated annotation and rapid domain specialization. The approach demonstrates robust cross-domain applicability, albeit with some challenges in highly occluded scenes and extremely diverse insect classes, and points to future work on adapters and integration with efficient detection backbones to broaden real-world impact.

Abstract

Deep learning models are transforming agricultural applications by enabling automated phenotyping, monitoring, and yield estimation. However, their effectiveness heavily depends on large amounts of annotated training data, which can be labor and time intensive. Recent advances in open-set object detection, particularly with models like Grounding-DINO, offer a potential solution to detect regions of interests based on text prompt input. Initial zero-shot experiments revealed challenges in crafting effective text prompts, especially for complex objects like individual leaves and visually similar classes. To address these limitations, we propose an efficient few-shot adaptation method that simplifies the Grounding-DINO architecture by removing the text encoder module (BERT) and introducing a randomly initialized trainable text embedding. This method achieves superior performance across multiple agricultural datasets, including plant-weed detection, plant counting, insect identification, fruit counting, and remote sensing tasks. Specifically, it demonstrates up to a higher mAP than fully fine-tuned YOLO models on agricultural datasets and outperforms previous state-of-the-art methods by in remote sensing, under few-shot learning conditions. Our method offers a promising solution for automating annotation and accelerating the development of specialized agricultural AI solutions.

Paper Structure

This paper contains 11 sections, 4 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Left: zero-shot vs our few-shot (4-shot, 16-shot) using Grounding DINO on Wheat Head (GWHD) david2023global and PhenoBench (PB) weyler2024phenobench. Zero-shot fails in cluttered/occluded environments, whereas our few-shot outperforms significantly. Right: mAP of our few-shot approach increases with more training images.
  • Figure 3: Figure compares zero-shot vs our few-shot approaches using Grounding-DINO on SB20sb20, Crop-Weedhaug2015crop, Grape Detection (WGISD)santos2020grape, and BUP20smitt2021pathobot. Zero-shot fails in cluttered/occluded environments, struggling to detect instances or distinguish similar classes. Our few-shot approach (4-shot and 16-shot) outperforms zero-shot on all datasets.
  • Figure 4: Qualitative comparison between zero-shot and few-shot approaches on DIOR li2020object dataset, demonstrating that our few-shot method achieves significantly better results.
  • Figure 5: Instance segmentation on PhenoBench dataset weyler2024phenobench using our few-shot Grounding-DINO and SAM2.
  • Figure : Grounding-DINO
  • ...and 2 more figures