Single-Shot 6DoF Pose and 3D Size Estimation for Robotic Strawberry Harvesting
Lun Li, Hamidreza Kasaei
TL;DR
The paper tackles robustly estimating the full $6DoF$ pose and $3D$ size of strawberries from a single RGB view to enable autonomous robotic harvesting. It introduces Straw6D, a synthetic dataset generated in Ignition Gazebo with domain randomization, and a two-stage, keypoints-based network inspired by YOLO that regresses a 22‑D vector per anchor and recovers pose via PnP, while handling symmetry with a multi-GT loss. The approach achieves strong synthetic performance (e.g., 3D IoU AP up to $84.77\%$ at a $0.5$ threshold) and demonstrates sim-to-real transfer when fine-tuned on real data, with real-time inference at up to 60 FPS. These findings show practical viability for real-world strawberry harvesting, including occluded or densely clustered fruits, and highlight the value of synthetic data with domain randomization for agricultural robotics. Future work envisions expanding to dual-arm harvesting to retrieve ripe fruit obscured by unripe berries.
Abstract
In this study, we introduce a deep-learning approach for determining both the 6DoF pose and 3D size of strawberries, aiming to significantly augment robotic harvesting efficiency. Our model was trained on a synthetic strawberry dataset, which is automatically generated within the Ignition Gazebo simulator, with a specific focus on the inherent symmetry exhibited by strawberries. By leveraging domain randomization techniques, the model demonstrated exceptional performance, achieving an 84.77\% average precision (AP) of 3D Intersection over Union (IoU) scores on the simulated dataset. Empirical evaluations, conducted by testing our model on real-world datasets, underscored the model's viability for real-world strawberry harvesting scenarios, even though its training was based on synthetic data. The model also exhibited robust occlusion handling abilities, maintaining accurate detection capabilities even when strawberries were obscured by other strawberries or foliage. Additionally, the model showcased remarkably swift inference speeds, reaching up to 60 frames per second (FPS).
