Sparse multi-view hand-object reconstruction for unseen environments

Yik Lung Pang; Changjae Oh; Andrea Cavallaro

Sparse multi-view hand-object reconstruction for unseen environments

Yik Lung Pang, Changjae Oh, Andrea Cavallaro

TL;DR

This work tackles the problem of reconstructing hand and unseen hand-held object shapes from sparse multi-view RGB input. It introduces SVHO, which uses per-view autoencoded hand and object shapes encoded as discrete latent cubes via Patchwise VQ-VAE, then aggregates view-wise predictions in a canonical space to produce a final reconstruction via marching cubes. Trained entirely on the synthetic ObMan dataset and evaluated on the real DexYCB dataset, SVHO demonstrates that additional sparse views improve hand reconstruction and can benefit object reconstruction under certain conditions, while highlighting the need for segmentation to mitigate background distraction. The approach offers a data-efficient alternative to dense multi-view methods, suitable for rapid adaptation to unseen objects in human–robot interaction scenarios.

Abstract

Recent works in hand-object reconstruction mainly focus on the single-view and dense multi-view settings. On the one hand, single-view methods can leverage learned shape priors to generalise to unseen objects but are prone to inaccuracies due to occlusions. On the other hand, dense multi-view methods are very accurate but cannot easily adapt to unseen objects without further data collection. In contrast, sparse multi-view methods can take advantage of the additional views to tackle occlusion, while keeping the computational cost low compared to dense multi-view methods. In this paper, we consider the problem of hand-object reconstruction with unseen objects in the sparse multi-view setting. Given multiple RGB images of the hand and object captured at the same time, our model SVHO combines the predictions from each view into a unified reconstruction without optimisation across views. We train our model on a synthetic hand-object dataset and evaluate directly on a real world recorded hand-object dataset with unseen objects. We show that while reconstruction of unseen hands and objects from RGB is challenging, additional views can help improve the reconstruction quality.

Sparse multi-view hand-object reconstruction for unseen environments

TL;DR

Abstract

Paper Structure (18 sections, 3 equations, 7 figures, 1 table)

This paper contains 18 sections, 3 equations, 7 figures, 1 table.

Introduction
Related works
3D shape representations
Single-view hand-object reconstruction
Multi-view hand-object reconstruction
Proposed method
Autoencoding hands and objects
Reconstruction from multi-view images
Experiments
Datasets
ObMan hasson19_obman
DexYCB chao2021dexycb
Implementation details
Metrics
Results
...and 3 more sections

Figures (7)

Figure 1: Single-view methods suffer from occlusion, while dense multi-view methods require a large amount of collected images. We propose to use sparse multi-view input to improve the reconstruction quality while keeping the data requirements low.
Figure 2: We first encode hand and object shape independently using Patchwise VQ-VAE (P-VQ-VAE). This provides a compact representation to train our hand object shape prior.
Figure 3: Our pipeline for hand object shape reconstruction from multi-view images. Predicted probabilities from individual views are averaged to get the final prediction.
Figure 4: Autoencoder reconstruction of 3D hand and objects
Figure 5: Average F-score and standard deviation across 6 runs for (a) hand and (b) object reconstruction when varying the number of input views
...and 2 more figures

Sparse multi-view hand-object reconstruction for unseen environments

TL;DR

Abstract

Sparse multi-view hand-object reconstruction for unseen environments

Authors

TL;DR

Abstract

Table of Contents

Figures (7)