Table of Contents
Fetching ...

Single-Shot 6DoF Pose and 3D Size Estimation for Robotic Strawberry Harvesting

Lun Li, Hamidreza Kasaei

TL;DR

The paper tackles robustly estimating the full $6DoF$ pose and $3D$ size of strawberries from a single RGB view to enable autonomous robotic harvesting. It introduces Straw6D, a synthetic dataset generated in Ignition Gazebo with domain randomization, and a two-stage, keypoints-based network inspired by YOLO that regresses a 22‑D vector per anchor and recovers pose via PnP, while handling symmetry with a multi-GT loss. The approach achieves strong synthetic performance (e.g., 3D IoU AP up to $84.77\%$ at a $0.5$ threshold) and demonstrates sim-to-real transfer when fine-tuned on real data, with real-time inference at up to 60 FPS. These findings show practical viability for real-world strawberry harvesting, including occluded or densely clustered fruits, and highlight the value of synthetic data with domain randomization for agricultural robotics. Future work envisions expanding to dual-arm harvesting to retrieve ripe fruit obscured by unripe berries.

Abstract

In this study, we introduce a deep-learning approach for determining both the 6DoF pose and 3D size of strawberries, aiming to significantly augment robotic harvesting efficiency. Our model was trained on a synthetic strawberry dataset, which is automatically generated within the Ignition Gazebo simulator, with a specific focus on the inherent symmetry exhibited by strawberries. By leveraging domain randomization techniques, the model demonstrated exceptional performance, achieving an 84.77\% average precision (AP) of 3D Intersection over Union (IoU) scores on the simulated dataset. Empirical evaluations, conducted by testing our model on real-world datasets, underscored the model's viability for real-world strawberry harvesting scenarios, even though its training was based on synthetic data. The model also exhibited robust occlusion handling abilities, maintaining accurate detection capabilities even when strawberries were obscured by other strawberries or foliage. Additionally, the model showcased remarkably swift inference speeds, reaching up to 60 frames per second (FPS).

Single-Shot 6DoF Pose and 3D Size Estimation for Robotic Strawberry Harvesting

TL;DR

The paper tackles robustly estimating the full pose and size of strawberries from a single RGB view to enable autonomous robotic harvesting. It introduces Straw6D, a synthetic dataset generated in Ignition Gazebo with domain randomization, and a two-stage, keypoints-based network inspired by YOLO that regresses a 22‑D vector per anchor and recovers pose via PnP, while handling symmetry with a multi-GT loss. The approach achieves strong synthetic performance (e.g., 3D IoU AP up to at a threshold) and demonstrates sim-to-real transfer when fine-tuned on real data, with real-time inference at up to 60 FPS. These findings show practical viability for real-world strawberry harvesting, including occluded or densely clustered fruits, and highlight the value of synthetic data with domain randomization for agricultural robotics. Future work envisions expanding to dual-arm harvesting to retrieve ripe fruit obscured by unripe berries.

Abstract

In this study, we introduce a deep-learning approach for determining both the 6DoF pose and 3D size of strawberries, aiming to significantly augment robotic harvesting efficiency. Our model was trained on a synthetic strawberry dataset, which is automatically generated within the Ignition Gazebo simulator, with a specific focus on the inherent symmetry exhibited by strawberries. By leveraging domain randomization techniques, the model demonstrated exceptional performance, achieving an 84.77\% average precision (AP) of 3D Intersection over Union (IoU) scores on the simulated dataset. Empirical evaluations, conducted by testing our model on real-world datasets, underscored the model's viability for real-world strawberry harvesting scenarios, even though its training was based on synthetic data. The model also exhibited robust occlusion handling abilities, maintaining accurate detection capabilities even when strawberries were obscured by other strawberries or foliage. Additionally, the model showcased remarkably swift inference speeds, reaching up to 60 frames per second (FPS).
Paper Structure (11 sections, 7 equations, 5 figures, 2 tables)

This paper contains 11 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: We proposed a 6DoF pose and 3D size estimator capable of estimating the size and pose of all strawberry instances in a single view simultaneously. Notably, it could identify strawberries that were previously unseen, without reliance on any specific strawberry CAD models. To train our model, we created a synthetic dataset, named Straw6D, tailored for 6DoF pose and 3D size estimation of strawberries (top-row). Furthermore, our model demonstrated sim-to-real transferability to real-world images (lower-row).
  • Figure 2: Overview of our two-stage 6DoF pose and 3D size estimation method. Given an image, the neural network segments it into an SxS grid. Each grid predicts a 22-dimensional vector encompassing all information about the strawberry's 3D bounding box. Once decoded, the 3D size can be directly obtained, and the 6DoF pose can be computed using PnP algorithm, combined with the camera's intrinsic parameters.
  • Figure 3: Explanation of how we generate a diverse 6DoF pose and 3D size dataset for strawberries in the Ignition Gazebo simulator: (a) Each generated strawberry is random, encompassing variations in plant shape, the distribution of strawberries on the plant, and the size, ripeness, and pose of the strawberries. (b) For each batch of strawberries, different lighting conditions are set to mimic natural lighting variations. (c) The camera randomly selects angles from a reasonable range to capture images. (d) For a given camera angle, the distance between the camera and the strawberries is also adjusted. The entire process is automated and continues until the preset sample size is reached. The final dataset, named Straw6D, includes RGB images, depth images, 3D bounding box annotations, strawberry instance segmentation masks and point clouds.
  • Figure 4: Inferences made on the synthetic test dataset. The orange 3D bounding boxes represent our predictions of the strawberries, as well as the white ones denote the ground truths.
  • Figure 5: Inferences made on the real-world strawberry dataset PEREZBORRERO2020105736. The orange 3D bounding boxes represent our predictions of the strawberries