MetaFruit Meets Foundation Models: Leveraging a Comprehensive Multi-Fruit Dataset for Advancing Agricultural Foundation Models

Jiajia Li; Kyle Lammers; Xunyuan Yin; Xiang Yin; Long He; Renfu Lu; Zhaojian Li

MetaFruit Meets Foundation Models: Leveraging a Comprehensive Multi-Fruit Dataset for Advancing Agricultural Foundation Models

Jiajia Li, Kyle Lammers, Xunyuan Yin, Xiang Yin, Long He, Renfu Lu, Zhaojian Li

TL;DR

This work tackles the generalization gap in fruit detection for robotic harvesting by introducing MetaFruit, a large multi-fruit dataset (4,248 images, 248,015 bounding boxes across five fruit types) collected from diverse U.S. orchards, and FMFruit, an open-set detector built on Vision Foundation Models that leverages Grounding DINO with cross-modal, language-guided components. FMFruit enables zero-shot, few-shot, and cross-class generalization, achieving strong open-set performance and transferring to external fruit datasets, while also supporting referring expression comprehension to follow natural-language prompts. The approach combines $\mathcal{L} = \mathcal{L}_1 + \mathcal{L}_{GIOU} + \mathcal{L}_{Cons}$ within a Transformer-based detector, and experiments demonstrate relationships between data diversity, few-shot adaptation, and cross-domain robustness, with practical implications for field-ready robotic harvesting. The dataset and code are released to spur further research in vision-based fruit harvesting, addressing labor and yield challenges in agriculture.

Abstract

Fruit harvesting poses a significant labor and financial burden for the industry, highlighting the critical need for advancements in robotic harvesting solutions. Machine vision-based fruit detection has been recognized as a crucial component for robust identification of fruits to guide robotic manipulation. Despite considerable progress in leveraging deep learning and machine learning techniques for fruit detection, a common shortfall is the inability to swiftly extend the developed models across different orchards and/or various fruit species. Additionally, the limited availability of pertinent data further compounds these challenges. In this work, we introduce MetaFruit, the largest publicly available multi-class fruit dataset, comprising 4,248 images and 248,015 manually labeled instances across diverse U.S. orchards. Furthermore, this study proposes an innovative open-set fruit detection system leveraging advanced Vision Foundation Models (VFMs) for fruit detection that can adeptly identify a wide array of fruit types under varying orchard conditions. This system not only demonstrates remarkable adaptability in learning from minimal data through few-shot learning but also shows the ability to interpret human instructions for subtle detection tasks. The performance of the developed foundation model is comprehensively evaluated using several metrics, which outperforms the existing state-of-the-art algorithms in both our MetaFruit dataset and other open-sourced fruit datasets, thereby setting a new benchmark in the field of agricultural technology and robotic harvesting. The MetaFruit dataset and detection framework are open-sourced to foster future research in vision-based fruit harvesting, marking a significant stride toward addressing the urgent needs of the agricultural sector.

MetaFruit Meets Foundation Models: Leveraging a Comprehensive Multi-Fruit Dataset for Advancing Agricultural Foundation Models

TL;DR

within a Transformer-based detector, and experiments demonstrate relationships between data diversity, few-shot adaptation, and cross-domain robustness, with practical implications for field-ready robotic harvesting. The dataset and code are released to spur further research in vision-based fruit harvesting, addressing labor and yield challenges in agriculture.

Abstract

Paper Structure (16 sections, 1 equation, 5 figures, 7 tables)

This paper contains 16 sections, 1 equation, 5 figures, 7 tables.

Introduction
Materials and Methods
MetaFruit dataset
VFMs for fruit detection
Few-shot learning
Evaluation metrics
Experimental setups
Results
Few-shot fruit detection performance
Performance of cross-class generalization
Performance on other fruit datasets
Performance of referring expression comprehension (REC)
Discussion
Challenges in real-world implementation
Integration of LLMs
...and 1 more sections

Figures (5)

Figure 1: Representative examples of MetaFruit dataset, including five fruit classes: (a) apple, (b) orange, (c) lemon, (d) grapefruit, and (e) tangerine.
Figure 2: The framework of the VFM for fruit detection based on the Grounding DINO liu2023grounding model.
Figure 3: Zero-shot and few-shot fruit detection visualization examples for (a) apple, (b) orange, (c) lemon, (d) grapefruit, and (e) tangerine. The bounding box confidence threshold is set as 0.2 and 0.3 for zero-shot and few-shot, respectively. Best view via zoom in.
Figure 4: Fruit visualization results of zero-shot and fine-tuning on other public datasets, where the bounding box confidence threshold is set as 0.3. Best view via zoom in.
Figure 5: Visualization examples of referring object detection. The first row displays results for the prompt "apple", while the second row shows responses to a more specific prompt, such as "apple with less occlusion" or "apple without occlusion by branch".

MetaFruit Meets Foundation Models: Leveraging a Comprehensive Multi-Fruit Dataset for Advancing Agricultural Foundation Models

TL;DR

Abstract

MetaFruit Meets Foundation Models: Leveraging a Comprehensive Multi-Fruit Dataset for Advancing Agricultural Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)