MetaFruit Meets Foundation Models: Leveraging a Comprehensive Multi-Fruit Dataset for Advancing Agricultural Foundation Models
Jiajia Li, Kyle Lammers, Xunyuan Yin, Xiang Yin, Long He, Renfu Lu, Zhaojian Li
TL;DR
This work tackles the generalization gap in fruit detection for robotic harvesting by introducing MetaFruit, a large multi-fruit dataset (4,248 images, 248,015 bounding boxes across five fruit types) collected from diverse U.S. orchards, and FMFruit, an open-set detector built on Vision Foundation Models that leverages Grounding DINO with cross-modal, language-guided components. FMFruit enables zero-shot, few-shot, and cross-class generalization, achieving strong open-set performance and transferring to external fruit datasets, while also supporting referring expression comprehension to follow natural-language prompts. The approach combines $\mathcal{L} = \mathcal{L}_1 + \mathcal{L}_{GIOU} + \mathcal{L}_{Cons}$ within a Transformer-based detector, and experiments demonstrate relationships between data diversity, few-shot adaptation, and cross-domain robustness, with practical implications for field-ready robotic harvesting. The dataset and code are released to spur further research in vision-based fruit harvesting, addressing labor and yield challenges in agriculture.
Abstract
Fruit harvesting poses a significant labor and financial burden for the industry, highlighting the critical need for advancements in robotic harvesting solutions. Machine vision-based fruit detection has been recognized as a crucial component for robust identification of fruits to guide robotic manipulation. Despite considerable progress in leveraging deep learning and machine learning techniques for fruit detection, a common shortfall is the inability to swiftly extend the developed models across different orchards and/or various fruit species. Additionally, the limited availability of pertinent data further compounds these challenges. In this work, we introduce MetaFruit, the largest publicly available multi-class fruit dataset, comprising 4,248 images and 248,015 manually labeled instances across diverse U.S. orchards. Furthermore, this study proposes an innovative open-set fruit detection system leveraging advanced Vision Foundation Models (VFMs) for fruit detection that can adeptly identify a wide array of fruit types under varying orchard conditions. This system not only demonstrates remarkable adaptability in learning from minimal data through few-shot learning but also shows the ability to interpret human instructions for subtle detection tasks. The performance of the developed foundation model is comprehensively evaluated using several metrics, which outperforms the existing state-of-the-art algorithms in both our MetaFruit dataset and other open-sourced fruit datasets, thereby setting a new benchmark in the field of agricultural technology and robotic harvesting. The MetaFruit dataset and detection framework are open-sourced to foster future research in vision-based fruit harvesting, marking a significant stride toward addressing the urgent needs of the agricultural sector.
