Table of Contents
Fetching ...

1st Place Solution of Multiview Egocentric Hand Tracking Challenge ECCV2024

Minqiang Zou, Zhi Lv, Riqiang Jin, Tian Zhan, Mochen Yu, Yao Tang, Jiajun Liang

TL;DR

This report presents a method that uses multi-view input images and camera extrinsic parameters to estimate both hand shape and pose and proposes an offline neural smoothing post-processing method to further improve the accuracy of hand position and pose.

Abstract

Multi-view egocentric hand tracking is a challenging task and plays a critical role in VR interaction. In this report, we present a method that uses multi-view input images and camera extrinsic parameters to estimate both hand shape and pose. To reduce overfitting to the camera layout, we apply crop jittering and extrinsic parameter noise augmentation. Additionally, we propose an offline neural smoothing post-processing method to further improve the accuracy of hand position and pose. Our method achieves 13.92mm MPJPE on the Umetrack dataset and 21.66mm MPJPE on the HOT3D dataset.

1st Place Solution of Multiview Egocentric Hand Tracking Challenge ECCV2024

TL;DR

This report presents a method that uses multi-view input images and camera extrinsic parameters to estimate both hand shape and pose and proposes an offline neural smoothing post-processing method to further improve the accuracy of hand position and pose.

Abstract

Multi-view egocentric hand tracking is a challenging task and plays a critical role in VR interaction. In this report, we present a method that uses multi-view input images and camera extrinsic parameters to estimate both hand shape and pose. To reduce overfitting to the camera layout, we apply crop jittering and extrinsic parameter noise augmentation. Additionally, we propose an offline neural smoothing post-processing method to further improve the accuracy of hand position and pose. Our method achieves 13.92mm MPJPE on the Umetrack dataset and 21.66mm MPJPE on the HOT3D dataset.
Paper Structure (4 sections, 2 equations, 2 figures, 1 table)

This paper contains 4 sections, 2 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Our method is flexible and supports both monocular and multi-view inputs. For simplicity, the figure illustrates the monocular process. After feature extraction by the backbone, 2D features (feat2d) are obtained and fed into the UV regressor to generate 2D points. Simultaneously, feat2d is processed through an MLP to produce 3D features (feat3d). We then use the FTL module proposed in han2022umetrack to generate translation-invariant features (feat3d w/o T) for predicting shape, pose, and global orientation, and translation-aware features (feat3d w/ T) for predicting position.
  • Figure 2: To maintain temporal consistency, we optimize the model's predictions over a sequence as trainable parameters. Simultaneously, we tackle the issue of inaccurate positioning caused by poor generalization to varying extrinsic parameters by employing a 2D projection loss and acceleration loss.