Stereo Hand-Object Reconstruction for Human-to-Robot Handover
Yik Lung Pang, Alessio Xompero, Changjae Oh, Andrea Cavallaro
TL;DR
The paper tackles robust hand-object reconstruction for human-to-robot handovers using wide-baseline stereo RGB to overcome depth sensing limitations on transparent objects. It introduces StereoHO, which learns discrete 3D shape embeddings for hands and objects from synthetic data and fuses two stereo views via $P = P_L P_R$ to produce a coherent TSDF-based surface with multi-view consistency. Key contributions include dual codebooks trained with a vector-quantized autoencoder on the ObMan dataset, an image-to-shape encoder trained with a weighted cross-entropy loss, and a complete handover pipeline validated on diverse objects including filled containers. The approach yields improved object Chamfer distances over RGB-based baselines, demonstrates robust handovers with wide-baseline RGB cameras, and generalizes beyond container shapes, indicating strong practical impact for human-robot collaboration; future work aims to speed up reconstruction and further enhance quality.
Abstract
Jointly estimating hand and object shape facilitates the grasping task in human-to-robot handovers. However, relying on hand-crafted prior knowledge about the geometric structure of the object fails when generalising to unseen objects, and depth sensors fail to detect transparent objects such as drinking glasses. In this work, we propose a stereo-based method for hand-object reconstruction that combines single-view reconstructions probabilistically to form a coherent stereo reconstruction. We learn 3D shape priors from a large synthetic hand-object dataset to ensure that our method is generalisable, and use RGB inputs to better capture transparent objects. We show that our method reduces the object Chamfer distance compared to existing RGB based hand-object reconstruction methods on single view and stereo settings. We process the reconstructed hand-object shape with a projection-based outlier removal step and use the output to guide a human-to-robot handover pipeline with wide-baseline stereo RGB cameras. Our hand-object reconstruction enables a robot to successfully receive a diverse range of household objects from the human.
