Table of Contents
Fetching ...

Single-to-Dual-View Adaptation for Egocentric 3D Hand Pose Estimation

Ruicong Liu, Takehiko Ohkawa, Mingfang Zhang, Yoichi Sato

TL;DR

This work tackles egocentric 3D hand pose estimation under dual-view scenarios without requiring multi-view annotations or camera parameters. It introduces S2DHand, an unsupervised framework that adapts a pre-trained single-view estimator to arbitrary dual views using two stereo constraints: cross-view consensus via attention-based merging and invariance of the inter-view rotation via rotation-guided refinement. The method relies on pseudo-labels generated by a momentum teacher and updated through a fused combination of ABM and RGR, enabling dual-view inference despite unknown camera layouts. Empirical results on AssemblyHands show significant gains over baselines and cross-dataset methods, with robust performance across camera pairs and practical applicability for dynamic camera configurations.

Abstract

The pursuit of accurate 3D hand pose estimation stands as a keystone for understanding human activity in the realm of egocentric vision. The majority of existing estimation methods still rely on single-view images as input, leading to potential limitations, e.g., limited field-of-view and ambiguity in depth. To address these problems, adding another camera to better capture the shape of hands is a practical direction. However, existing multi-view hand pose estimation methods suffer from two main drawbacks: 1) Requiring multi-view annotations for training, which are expensive. 2) During testing, the model becomes inapplicable if camera parameters/layout are not the same as those used in training. In this paper, we propose a novel Single-to-Dual-view adaptation (S2DHand) solution that adapts a pre-trained single-view estimator to dual views. Compared with existing multi-view training methods, 1) our adaptation process is unsupervised, eliminating the need for multi-view annotation. 2) Moreover, our method can handle arbitrary dual-view pairs with unknown camera parameters, making the model applicable to diverse camera settings. Specifically, S2DHand is built on certain stereo constraints, including pair-wise cross-view consensus and invariance of transformation between both views. These two stereo constraints are used in a complementary manner to generate pseudo-labels, allowing reliable adaptation. Evaluation results reveal that S2DHand achieves significant improvements on arbitrary camera pairs under both in-dataset and cross-dataset settings, and outperforms existing adaptation methods with leading performance. Project page: https://github.com/MickeyLLG/S2DHand.

Single-to-Dual-View Adaptation for Egocentric 3D Hand Pose Estimation

TL;DR

This work tackles egocentric 3D hand pose estimation under dual-view scenarios without requiring multi-view annotations or camera parameters. It introduces S2DHand, an unsupervised framework that adapts a pre-trained single-view estimator to arbitrary dual views using two stereo constraints: cross-view consensus via attention-based merging and invariance of the inter-view rotation via rotation-guided refinement. The method relies on pseudo-labels generated by a momentum teacher and updated through a fused combination of ABM and RGR, enabling dual-view inference despite unknown camera layouts. Empirical results on AssemblyHands show significant gains over baselines and cross-dataset methods, with robust performance across camera pairs and practical applicability for dynamic camera configurations.

Abstract

The pursuit of accurate 3D hand pose estimation stands as a keystone for understanding human activity in the realm of egocentric vision. The majority of existing estimation methods still rely on single-view images as input, leading to potential limitations, e.g., limited field-of-view and ambiguity in depth. To address these problems, adding another camera to better capture the shape of hands is a practical direction. However, existing multi-view hand pose estimation methods suffer from two main drawbacks: 1) Requiring multi-view annotations for training, which are expensive. 2) During testing, the model becomes inapplicable if camera parameters/layout are not the same as those used in training. In this paper, we propose a novel Single-to-Dual-view adaptation (S2DHand) solution that adapts a pre-trained single-view estimator to dual views. Compared with existing multi-view training methods, 1) our adaptation process is unsupervised, eliminating the need for multi-view annotation. 2) Moreover, our method can handle arbitrary dual-view pairs with unknown camera parameters, making the model applicable to diverse camera settings. Specifically, S2DHand is built on certain stereo constraints, including pair-wise cross-view consensus and invariance of transformation between both views. These two stereo constraints are used in a complementary manner to generate pseudo-labels, allowing reliable adaptation. Evaluation results reveal that S2DHand achieves significant improvements on arbitrary camera pairs under both in-dataset and cross-dataset settings, and outperforms existing adaptation methods with leading performance. Project page: https://github.com/MickeyLLG/S2DHand.
Paper Structure (21 sections, 8 equations, 9 figures, 5 tables)

This paper contains 21 sections, 8 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: From (a) to (b), our single-to-dual-view adaptation method adapts a traditional single-view hand pose estimator to arbitrary dual views. The adapted model becomes more accurate under the dual-view setting. (a) Traditional single-view hand pose estimation. (b) Inference process of the adapted model under a dual-view setting.
  • Figure 2: Problem setting of single-to-dual-view adaptation for hand pose estimation. (a) The input and output of adaptation. (b) The dual-view testing scheme after adaptation.
  • Figure 3: Top: Headset and its camera layout to collect multi-view data, and samples from the four views. Bottom: Samples of synthetic data. Image samples are from AssemblyHands H:ohkawa2023assemblyhands, GANerated Hands H:GANeratedHands_CVPR2018, and Rendered Handpose H:zb2017hand, respectively.
  • Figure 4: Overview of the proposed S2DHand, image pairs captured from arbitrarily placed dual cameras are input for adaptation. The architecture of S2DHand is illustrated in the dark dashed box, which contains a dynamically updated estimator and a momentum estimator. The momentum estimator's predictions are used to generate pseudo-labels, which are then processed by our pseudo-labeling module (\ref{['sec:merge', 'sec:refine']}). Using the pseudo-labels, a loss function is computed to update the estimator. The rotation matrix $R$ from the initialization step (\ref{['sec:initialization']}) is required for the pseudo-labeling.
  • Figure 5: Top: illustration of the first part of pseudo-labeling: attention-based merging module. The generating process of $\hat{y}_{abm}^{v2}$ in view2 is shown as an example, the process of view1 is the same. Bottom: visualizations of heatmaps with different accuracy.
  • ...and 4 more figures