SiMA-Hand: Boosting 3D Hand-Mesh Reconstruction by Single-to-Multi-View Adaptation

Yinqiao Wang; Hao Xu; Pheng-Ann Heng; Chi-Wing Fu

SiMA-Hand: Boosting 3D Hand-Mesh Reconstruction by Single-to-Multi-View Adaptation

Yinqiao Wang, Hao Xu, Pheng-Ann Heng, Chi-Wing Fu

TL;DR

SiMA-Hand tackles occlusion in RGB-based 3D hand-mesh reconstruction by separating view-dependent orientation from view-independent shape and leveraging multi-view information during training. A dual-branch architecture (MVR-Hand and SVR-Hand) enables image-, joint-, and vertex-level feature fusion across views, while single-to-multi-view adaptation distills multi-view knowledge into the single-view reconstructor through hand-shape and hand-orientation feature enhancements. The approach achieves state-of-the-art results on Dex-YCB and HanCo, demonstrating robust performance under heavy occlusion and preserving real-time efficiency. This framework provides a practical pathway to high-quality hand meshes from monocular inputs in realistic, occlusion-heavy scenarios, with code to be released for reproducibility and broader application.

Abstract

Estimating 3D hand mesh from RGB images is a longstanding track, in which occlusion is one of the most challenging problems. Existing attempts towards this task often fail when the occlusion dominates the image space. In this paper, we propose SiMA-Hand, aiming to boost the mesh reconstruction performance by Single-to-Multi-view Adaptation. First, we design a multi-view hand reconstructor to fuse information across multiple views by holistically adopting feature fusion at image, joint, and vertex levels. Then, we introduce a single-view hand reconstructor equipped with SiMA. Though taking only one view as input at inference, the shape and orientation features in the single-view reconstructor can be enriched by learning non-occluded knowledge from the extra views at training, enhancing the reconstruction precision on the occluded regions. We conduct experiments on the Dex-YCB and HanCo benchmarks with challenging object- and self-caused occlusion cases, manifesting that SiMA-Hand consistently achieves superior performance over the state of the arts. Code will be released on https://github.com/JoyboyWang/SiMA-Hand Pytorch.

SiMA-Hand: Boosting 3D Hand-Mesh Reconstruction by Single-to-Multi-View Adaptation

TL;DR

Abstract

Paper Structure (33 sections, 12 equations, 5 figures, 6 tables)

This paper contains 33 sections, 12 equations, 5 figures, 6 tables.

Introduction
Related Work
Single-view hand-mesh reconstruction.
Multi-frame hand-mesh reconstruction.
Multi-view hand reconstruction.
Domain alignment.
Method
Overview
Dual-branch structure.
Multi-view Reconstruction
Image-level feature fusion.
Joint-level feature fusion.
Vertex-level feature fusion.
Single-to-Multi-view Adaptation
Hand-shape feature enhancement.
...and 18 more sections

Figures (5)

Figure 1: Framework comparison between our SiMA-Hand and previous methods. We exploit non-occluded information from the multi-view reconstructor (MVR) and carefully adopt it to the single-view reconstructor (SVR) through the proposed single-to-multi-view adaptation techniques. So, our SVR can learn to extract orientation and shape features, following the MVR's, and produce higher quality meshes.
Figure 2: The architecture of SiMA-Hand: (i) the MVR-Hand takes multiple views of the hand as input for 3D hand-mesh reconstruction by fusing multi-view features at image, joint, and vertex levels; and (ii) the SVR-Hand takes only one view as input and learns to output a high-quality 3D hand mesh, even under a severely-occluded situation (see the target view at lower left), with both shape and orientation feature enhancement from the MVR-Hand. In the mathematical notations, SVR and MVR are denoted by superscripts $S$ and $M$, respectively. Also, $f$ denotes single-view or fused features, $\mathbf{f}$ denotes multi-view features, whereas subscripts $i$, $j$, $v$, and $o$ denote image, joint, vertex, and orientation, respectively.
Figure 3: The designed modules in SiMA-Hand. (a) The structures of image-, joint-, and vertex-level feature fusion modules; and (b) the architecture of the orientation feature enhancement module.
Figure 4: Qualitative comparison of our method and state-of-the-art 3D hand-mesh reconstruction methods on different datasets. The first and second rows in each example denote the normal view and another view, respectively, for better comparison.
Figure 5: The mesh AUC comparison under different thresholds. Our method performs better than others, consistently.

SiMA-Hand: Boosting 3D Hand-Mesh Reconstruction by Single-to-Multi-View Adaptation

TL;DR

Abstract

SiMA-Hand: Boosting 3D Hand-Mesh Reconstruction by Single-to-Multi-View Adaptation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)