Generating Future Observations to Estimate Grasp Success in Cluttered Environments
Daniel Fernandes Gomes, Wenxuan Mou, Paolo Paoletti, Shan Luo
TL;DR
This work tackles grasp success estimation in cluttered environments by contrasting a model-free end-to-end predictor with a model-based pipeline that forecasts a future observation $\hat{I}_d$ prior to grasping and uses it to determine success. Using a robot-arm bin-picking setup, the authors collect $24{,}364$ grasps with autonomous annotations to evaluate methods on pre-grasp, during-grasp, and post-grasp data. The model-free approach achieves up to $72\%$ accuracy, while the model-based approach with hallucinated future views reaches $82\%$ on validation, a substantial improvement that highlights the value of future-view prediction for manipulation tasks. These results suggest that self-supervised predictive models can improve data efficiency and robustness in grasping under clutter, with future work aiming to enhance generated observations via diffusion/transformer techniques and to integrate optical tactile sensing for Sim2Real transfer.
Abstract
End-to-end self-supervised models have been proposed for estimating the success of future candidate grasps and video predictive models for generating future observations. However, none have yet studied these two strategies side-by-side for addressing the aforementioned grasping problem. We investigate and compare a model-free approach, to estimate the success of a candidate grasp, against a model-based alternative that exploits a self-supervised learnt predictive model that generates a future observation of the gripper about to grasp an object. Our experiments demonstrate that despite the end-to-end model-free model obtaining a best accuracy of 72%, the proposed model-based pipeline yields a significantly higher accuracy of 82%.
