Table of Contents
Fetching ...

Sparse multi-view hand-object reconstruction for unseen environments

Yik Lung Pang, Changjae Oh, Andrea Cavallaro

TL;DR

This work tackles the problem of reconstructing hand and unseen hand-held object shapes from sparse multi-view RGB input. It introduces SVHO, which uses per-view autoencoded hand and object shapes encoded as discrete latent cubes via Patchwise VQ-VAE, then aggregates view-wise predictions in a canonical space to produce a final reconstruction via marching cubes. Trained entirely on the synthetic ObMan dataset and evaluated on the real DexYCB dataset, SVHO demonstrates that additional sparse views improve hand reconstruction and can benefit object reconstruction under certain conditions, while highlighting the need for segmentation to mitigate background distraction. The approach offers a data-efficient alternative to dense multi-view methods, suitable for rapid adaptation to unseen objects in human–robot interaction scenarios.

Abstract

Recent works in hand-object reconstruction mainly focus on the single-view and dense multi-view settings. On the one hand, single-view methods can leverage learned shape priors to generalise to unseen objects but are prone to inaccuracies due to occlusions. On the other hand, dense multi-view methods are very accurate but cannot easily adapt to unseen objects without further data collection. In contrast, sparse multi-view methods can take advantage of the additional views to tackle occlusion, while keeping the computational cost low compared to dense multi-view methods. In this paper, we consider the problem of hand-object reconstruction with unseen objects in the sparse multi-view setting. Given multiple RGB images of the hand and object captured at the same time, our model SVHO combines the predictions from each view into a unified reconstruction without optimisation across views. We train our model on a synthetic hand-object dataset and evaluate directly on a real world recorded hand-object dataset with unseen objects. We show that while reconstruction of unseen hands and objects from RGB is challenging, additional views can help improve the reconstruction quality.

Sparse multi-view hand-object reconstruction for unseen environments

TL;DR

This work tackles the problem of reconstructing hand and unseen hand-held object shapes from sparse multi-view RGB input. It introduces SVHO, which uses per-view autoencoded hand and object shapes encoded as discrete latent cubes via Patchwise VQ-VAE, then aggregates view-wise predictions in a canonical space to produce a final reconstruction via marching cubes. Trained entirely on the synthetic ObMan dataset and evaluated on the real DexYCB dataset, SVHO demonstrates that additional sparse views improve hand reconstruction and can benefit object reconstruction under certain conditions, while highlighting the need for segmentation to mitigate background distraction. The approach offers a data-efficient alternative to dense multi-view methods, suitable for rapid adaptation to unseen objects in human–robot interaction scenarios.

Abstract

Recent works in hand-object reconstruction mainly focus on the single-view and dense multi-view settings. On the one hand, single-view methods can leverage learned shape priors to generalise to unseen objects but are prone to inaccuracies due to occlusions. On the other hand, dense multi-view methods are very accurate but cannot easily adapt to unseen objects without further data collection. In contrast, sparse multi-view methods can take advantage of the additional views to tackle occlusion, while keeping the computational cost low compared to dense multi-view methods. In this paper, we consider the problem of hand-object reconstruction with unseen objects in the sparse multi-view setting. Given multiple RGB images of the hand and object captured at the same time, our model SVHO combines the predictions from each view into a unified reconstruction without optimisation across views. We train our model on a synthetic hand-object dataset and evaluate directly on a real world recorded hand-object dataset with unseen objects. We show that while reconstruction of unseen hands and objects from RGB is challenging, additional views can help improve the reconstruction quality.
Paper Structure (18 sections, 3 equations, 7 figures, 1 table)

This paper contains 18 sections, 3 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Single-view methods suffer from occlusion, while dense multi-view methods require a large amount of collected images. We propose to use sparse multi-view input to improve the reconstruction quality while keeping the data requirements low.
  • Figure 2: We first encode hand and object shape independently using Patchwise VQ-VAE (P-VQ-VAE). This provides a compact representation to train our hand object shape prior.
  • Figure 3: Our pipeline for hand object shape reconstruction from multi-view images. Predicted probabilities from individual views are averaged to get the final prediction.
  • Figure 4: Autoencoder reconstruction of 3D hand and objects
  • Figure 5: Average F-score and standard deviation across 6 runs for (a) hand and (b) object reconstruction when varying the number of input views
  • ...and 2 more figures