Table of Contents
Fetching ...

Learn from Foundation Model: Fruit Detection Model without Manual Annotation

Yanan Wang, Zhenghao Fei, Ruichen Li, Yibin Ying

TL;DR

This study proposes a framework to train effective, domain-specific, small models from foundation models without manual annotation, and demonstrates strong performance across various fruit detection tasks object detection, semantic segmentation, and instance segmentation without manual annotation.

Abstract

Recent breakthroughs in large foundation models have enabled the possibility of transferring knowledge pre-trained on vast datasets to domains with limited data availability. Agriculture is one of the domains that lacks sufficient data. This study proposes a framework to train effective, domain-specific, small models from foundation models without manual annotation. Our approach begins with SDM (Segmentation-Description-Matching), a stage that leverages two foundation models: SAM2 (Segment Anything in Images and Videos) for segmentation and OpenCLIP (Open Contrastive Language-Image Pretraining) for zero-shot open-vocabulary classification. In the second stage, a novel knowledge distillation mechanism is utilized to distill compact, edge-deployable models from SDM, enhancing both inference speed and perception accuracy. The complete method, termed SDM-D (Segmentation-Description-Matching-Distilling), demonstrates strong performance across various fruit detection tasks object detection, semantic segmentation, and instance segmentation) without manual annotation. It nearly matches the performance of models trained with abundant labels. Notably, SDM-D outperforms open-set detection methods such as Grounding SAM and YOLO-World on all tested fruit detection datasets. Additionally, we introduce MegaFruits, a comprehensive fruit segmentation dataset encompassing over 25,000 images, and all code and datasets are made publicly available at https://github.com/AgRoboticsResearch/SDM-D.git.

Learn from Foundation Model: Fruit Detection Model without Manual Annotation

TL;DR

This study proposes a framework to train effective, domain-specific, small models from foundation models without manual annotation, and demonstrates strong performance across various fruit detection tasks object detection, semantic segmentation, and instance segmentation without manual annotation.

Abstract

Recent breakthroughs in large foundation models have enabled the possibility of transferring knowledge pre-trained on vast datasets to domains with limited data availability. Agriculture is one of the domains that lacks sufficient data. This study proposes a framework to train effective, domain-specific, small models from foundation models without manual annotation. Our approach begins with SDM (Segmentation-Description-Matching), a stage that leverages two foundation models: SAM2 (Segment Anything in Images and Videos) for segmentation and OpenCLIP (Open Contrastive Language-Image Pretraining) for zero-shot open-vocabulary classification. In the second stage, a novel knowledge distillation mechanism is utilized to distill compact, edge-deployable models from SDM, enhancing both inference speed and perception accuracy. The complete method, termed SDM-D (Segmentation-Description-Matching-Distilling), demonstrates strong performance across various fruit detection tasks object detection, semantic segmentation, and instance segmentation) without manual annotation. It nearly matches the performance of models trained with abundant labels. Notably, SDM-D outperforms open-set detection methods such as Grounding SAM and YOLO-World on all tested fruit detection datasets. Additionally, we introduce MegaFruits, a comprehensive fruit segmentation dataset encompassing over 25,000 images, and all code and datasets are made publicly available at https://github.com/AgRoboticsResearch/SDM-D.git.

Paper Structure

This paper contains 22 sections, 1 equation, 12 figures, 9 tables, 1 algorithm.

Figures (12)

  • Figure 1: SDM-D can simultaneously detect and segment input images based on the prompts, and enable distillation of knowledge from foundation models to faster, smaller models.
  • Figure 2: Overall framework of SDM-D. (a) Segmentation: is based on SAM2, utilizing the sam2_hiera_large weight with a 32 × 32 grid of points and no additional prompts. (b) Mask NMS: was proposed to reserve the optimal mask. (c) Description and Image Segments Encoding: OpenCLIP is used to encode the description and image segments. (d) Region-Text Matching: this is used to image regions with corresponding textual descriptions. (e) Distilling: involves transferring knowledge to smaller models that are faster and perform better. The SDM consists of sections (a), (b), (c), and (d) with pink headings, and SDM-D is a combination of SDM and section (e).
  • Figure 3: Representative examples of MegaFruits: (a) Object detection task: the label is the smallest rectangular box to enclose each fruit. (b) Semantic segmentation task: all the instances in an image are divided into masks and stored together in a single-channel mask image. (c) Instance segmentation task: the label is a set of polygon points around individual fruit instances.
  • Figure 4: Comparison of object detection results on a strawberry image.
  • Figure 5: Comparison of semantic segmentation results.
  • ...and 7 more figures