Table of Contents
Fetching ...

Mind the Shape Gap: A Benchmark and Baseline for Deformation-Aware 6D Pose Estimation of Agricultural Produce

Nikolas Chatzis, Angeliki Tsinouka, Katerina Papadimitriou, Niki Efthymiou, Marios Glytsos, George Retsinas, Paris Oikonomou, Gerasimos Potamianos, Petros Maragos, Panagiotis Paraskevas Filntisis

Abstract

Accurate 6D pose estimation for robotic harvesting is fundamentally hindered by the biological deformability and high intra-class shape variability of agricultural produce. Instance-level methods fail in this setting, as obtaining exact 3D models for every unique piece of produce is practically infeasible, while category-level approaches that rely on a fixed template suffer significant accuracy degradation when the prior deviates from the true instance geometry. To bridge such lack of robustness to deformation, we introduce PEAR (Pose and dEformation of Agricultural pRoduce), the first benchmark providing joint 6D pose and per-instance 3D deformation ground truth across 8 produce categories, acquired via a robotic manipulator for high annotation accuracy. Using PEAR, we show that state-of-the-art methods suffer up to 6x performance degradation when faced with the inherent geometric deviations of real-world produce. Motivated by this finding, we propose SEED (Simultaneous Estimation of posE and Deformation), a unified RGB-only framework that jointly predicts 6D pose and explicit lattice deformations from a single image across multiple produce categories. Trained entirely on synthetic data with generative texture augmentation applied at the UV level, SEED outperforms MegaPose on 6 out of 8 categories under identical RGB-only conditions, demonstrating that explicit shape modeling is a critical step toward reliable pose estimation in agricultural robotics.

Mind the Shape Gap: A Benchmark and Baseline for Deformation-Aware 6D Pose Estimation of Agricultural Produce

Abstract

Accurate 6D pose estimation for robotic harvesting is fundamentally hindered by the biological deformability and high intra-class shape variability of agricultural produce. Instance-level methods fail in this setting, as obtaining exact 3D models for every unique piece of produce is practically infeasible, while category-level approaches that rely on a fixed template suffer significant accuracy degradation when the prior deviates from the true instance geometry. To bridge such lack of robustness to deformation, we introduce PEAR (Pose and dEformation of Agricultural pRoduce), the first benchmark providing joint 6D pose and per-instance 3D deformation ground truth across 8 produce categories, acquired via a robotic manipulator for high annotation accuracy. Using PEAR, we show that state-of-the-art methods suffer up to 6x performance degradation when faced with the inherent geometric deviations of real-world produce. Motivated by this finding, we propose SEED (Simultaneous Estimation of posE and Deformation), a unified RGB-only framework that jointly predicts 6D pose and explicit lattice deformations from a single image across multiple produce categories. Trained entirely on synthetic data with generative texture augmentation applied at the UV level, SEED outperforms MegaPose on 6 out of 8 categories under identical RGB-only conditions, demonstrating that explicit shape modeling is a critical step toward reliable pose estimation in agricultural robotics.

Paper Structure

This paper contains 29 sections, 8 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 2: Sample annotated frames from the PEAR dataset.
  • Figure 3: Effect of template shape mismatch on pose estimation. Each row shows the results of foundation models for a selected frame of PEAR dataset. From left to right: FoundationPose predictions using the ground-truth mesh, predictions using a category-level base mesh, and a reference view for comparison; the same three visualizations are shown for MegaPose in the last three columns. The results highlight the sensitivity of pose estimators to geometric discrepancies between the template mesh and the true object shape.
  • Figure 4: PEAR benchmark acquisition framework. (a) An Orbbec Femto Bolt RGB-D camera is rigidly mounted to a Franka Panda manipulator, observing the produce from a fixed distance. (b) To ensure comprehensive coverage, viewpoints are uniformly sampled across three concentric spheres along an $80^\circ$ azimuth arc facing the robot, always oriented directly toward the scene center.
  • Figure 5: Left: input RGB images. Middle: initial pose predictions from FoundationPose given the depth image and object mask. Right: pose estimates after our joint multi-view optimization. The first row shows the object in an unoccluded setting, while the second row shows the same object placement under occlusion.
  • Figure 6: Overview of the SEED model. A cropped RGB detection is passed through a DINOv2 backbone with LoRA adaptation, producing features that feed three prediction heads for rotation, translation, and lattice deformation. The predicted deformation warps the category base mesh to match the observed instance geometry.
  • ...and 1 more figures