Table of Contents
Fetching ...

Stereo Hand-Object Reconstruction for Human-to-Robot Handover

Yik Lung Pang, Alessio Xompero, Changjae Oh, Andrea Cavallaro

TL;DR

The paper tackles robust hand-object reconstruction for human-to-robot handovers using wide-baseline stereo RGB to overcome depth sensing limitations on transparent objects. It introduces StereoHO, which learns discrete 3D shape embeddings for hands and objects from synthetic data and fuses two stereo views via $P = P_L P_R$ to produce a coherent TSDF-based surface with multi-view consistency. Key contributions include dual codebooks trained with a vector-quantized autoencoder on the ObMan dataset, an image-to-shape encoder trained with a weighted cross-entropy loss, and a complete handover pipeline validated on diverse objects including filled containers. The approach yields improved object Chamfer distances over RGB-based baselines, demonstrates robust handovers with wide-baseline RGB cameras, and generalizes beyond container shapes, indicating strong practical impact for human-robot collaboration; future work aims to speed up reconstruction and further enhance quality.

Abstract

Jointly estimating hand and object shape facilitates the grasping task in human-to-robot handovers. However, relying on hand-crafted prior knowledge about the geometric structure of the object fails when generalising to unseen objects, and depth sensors fail to detect transparent objects such as drinking glasses. In this work, we propose a stereo-based method for hand-object reconstruction that combines single-view reconstructions probabilistically to form a coherent stereo reconstruction. We learn 3D shape priors from a large synthetic hand-object dataset to ensure that our method is generalisable, and use RGB inputs to better capture transparent objects. We show that our method reduces the object Chamfer distance compared to existing RGB based hand-object reconstruction methods on single view and stereo settings. We process the reconstructed hand-object shape with a projection-based outlier removal step and use the output to guide a human-to-robot handover pipeline with wide-baseline stereo RGB cameras. Our hand-object reconstruction enables a robot to successfully receive a diverse range of household objects from the human.

Stereo Hand-Object Reconstruction for Human-to-Robot Handover

TL;DR

The paper tackles robust hand-object reconstruction for human-to-robot handovers using wide-baseline stereo RGB to overcome depth sensing limitations on transparent objects. It introduces StereoHO, which learns discrete 3D shape embeddings for hands and objects from synthetic data and fuses two stereo views via to produce a coherent TSDF-based surface with multi-view consistency. Key contributions include dual codebooks trained with a vector-quantized autoencoder on the ObMan dataset, an image-to-shape encoder trained with a weighted cross-entropy loss, and a complete handover pipeline validated on diverse objects including filled containers. The approach yields improved object Chamfer distances over RGB-based baselines, demonstrates robust handovers with wide-baseline RGB cameras, and generalizes beyond container shapes, indicating strong practical impact for human-robot collaboration; future work aims to speed up reconstruction and further enhance quality.

Abstract

Jointly estimating hand and object shape facilitates the grasping task in human-to-robot handovers. However, relying on hand-crafted prior knowledge about the geometric structure of the object fails when generalising to unseen objects, and depth sensors fail to detect transparent objects such as drinking glasses. In this work, we propose a stereo-based method for hand-object reconstruction that combines single-view reconstructions probabilistically to form a coherent stereo reconstruction. We learn 3D shape priors from a large synthetic hand-object dataset to ensure that our method is generalisable, and use RGB inputs to better capture transparent objects. We show that our method reduces the object Chamfer distance compared to existing RGB based hand-object reconstruction methods on single view and stereo settings. We process the reconstructed hand-object shape with a projection-based outlier removal step and use the output to guide a human-to-robot handover pipeline with wide-baseline stereo RGB cameras. Our hand-object reconstruction enables a robot to successfully receive a diverse range of household objects from the human.

Paper Structure

This paper contains 15 sections, 7 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: We reconstruct the hand-object pointcloud from stereo RGB input for human-to-robot handover. (1) A safe grasp is selected for the handover and the robot moves in to grasp the object. (2) The object is delivered to a target location. (3) The robot returns to its home position.
  • Figure 2: Limitations of existing human-to-robot handover approaches. (a) Depth-based sensing fails on transparent objects. (b) Relying on shape priors does not generalise.
  • Figure 3: Our proposed method, StereoHO, for hand-object reconstruction with two cropped images from a wide-baseline stereo camera. StereoHO predicts the probability distributions, $P_{L} = [P_{HL}, P_{OL}]$ and $P_{R} = [P_{HR}, P_{OR}]$, over the shape codebooks for each view and combined them into a coherent probability distribution $P = [P_{H}, P_{O}]$. The trained SDF decoder transforms $P$ into the hand-object T-SDF. Sampled surface points from the T-SDF (pointcloud $\mathcal{P}$) are projected into each view using the predicted camera projection parameters and outliers are removed by using the segmentation masks to obtain the final pointcloud $\mathcal{P}'$. KEYS -- $K$: intrinsics calibration parameters, $L$: left view, $R$: right view; $H$: hand, $O$: object.
  • Figure 4: Our proposed pipeline for human-to-robot handover. For each frame, hand-object detection extracts bounding boxes $B$ to crop the input images around the hand and then used to segment hand-object masks $M$ and estimate wrist poses $T_{H}$. Our StereoHO uses these outputs along with the image crops to reconstruct the hand-object shape $\mathcal{P}'$. We estimate the grasp $g$ on the reconstructed shape and transform it into the robot coordinate system using the wrist poses $T_{HB}$.
  • Figure 5: Synthetic object types used for training StereoHO (top row), and objects for testing hand-object reconstruction (middle row). For handovers (bottom row): 3 cups and 1 glass, empty or filled with rice, from a benchmarking protocol sanchez2020benchmark (left) and 8 household objects (right) to assess generalisation.
  • ...and 4 more figures