Table of Contents
Fetching ...

Egocentric Visibility-Aware Human Pose Estimation

Peng Dai, Yu Zhang, Yiqiang Feng, Zhen Fan, Yang Zhang

TL;DR

This paper presents Eva-3M, a large-scale egocentric visibility-aware HPE dataset comprising over 3.0M frames, with 435K of them annotated with keypoint visibility labels, and proposes EvaPose, a novel egocentric visibility-aware HPE method that explicitly incorporates visibility information to enhance pose estimation accuracy.

Abstract

Egocentric human pose estimation (HPE) using a head-mounted device is crucial for various VR and AR applications, but it faces significant challenges due to keypoint invisibility. Nevertheless, none of the existing egocentric HPE datasets provide keypoint visibility annotations, and the existing methods often overlook the invisibility problem, treating visible and invisible keypoints indiscriminately during estimation. As a result, their capacity to accurately predict visible keypoints is compromised. In this paper, we first present Eva-3M, a large-scale egocentric visibility-aware HPE dataset comprising over 3.0M frames, with 435K of them annotated with keypoint visibility labels. Additionally, we augment the existing EMHI dataset with keypoint visibility annotations to further facilitate the research in this direction. Furthermore, we propose EvaPose, a novel egocentric visibility-aware HPE method that explicitly incorporates visibility information to enhance pose estimation accuracy. Extensive experiments validate the significant value of ground-truth visibility labels in egocentric HPE settings, and demonstrate that our EvaPose achieves state-of-the-art performance in both Eva-3M and EMHI datasets.

Egocentric Visibility-Aware Human Pose Estimation

TL;DR

This paper presents Eva-3M, a large-scale egocentric visibility-aware HPE dataset comprising over 3.0M frames, with 435K of them annotated with keypoint visibility labels, and proposes EvaPose, a novel egocentric visibility-aware HPE method that explicitly incorporates visibility information to enhance pose estimation accuracy.

Abstract

Egocentric human pose estimation (HPE) using a head-mounted device is crucial for various VR and AR applications, but it faces significant challenges due to keypoint invisibility. Nevertheless, none of the existing egocentric HPE datasets provide keypoint visibility annotations, and the existing methods often overlook the invisibility problem, treating visible and invisible keypoints indiscriminately during estimation. As a result, their capacity to accurately predict visible keypoints is compromised. In this paper, we first present Eva-3M, a large-scale egocentric visibility-aware HPE dataset comprising over 3.0M frames, with 435K of them annotated with keypoint visibility labels. Additionally, we augment the existing EMHI dataset with keypoint visibility annotations to further facilitate the research in this direction. Furthermore, we propose EvaPose, a novel egocentric visibility-aware HPE method that explicitly incorporates visibility information to enhance pose estimation accuracy. Extensive experiments validate the significant value of ground-truth visibility labels in egocentric HPE settings, and demonstrate that our EvaPose achieves state-of-the-art performance in both Eva-3M and EMHI datasets.
Paper Structure (23 sections, 16 equations, 6 figures, 6 tables)

This paper contains 23 sections, 16 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: We introduce Eva-3M, a large-scale egocentric visibility-aware dataset comprising over 3.0M frames from (a) 31 subjects in daily outfits (b) performing 24 types of daily actions, and of which (c) 435K are annotated with keypoint visibility labels. (d) shows the normalized spatial distribution comparison between Eva-3M and EMHI, indicating that Eva-3M has a wider range of motion diversity than EMHI. (e) illustrates a few representative examples of keypoint invisibility due to self-occlusion and out-of-FoV in Eva-3M dataset.
  • Figure 2: Overview of the proposed EvaPose. Given a sequence of egocentric observations, we first propose a visibility-aware 3D pose estimation network to extract stereo image features, and predict per-frame 3D keypoints in the camera coordinate system (defined as the left camera coordinate system in this paper) and their corresponding visibility confidence scores. Then, the predicted 3D keypoints are transformed to the canonical coordinate system with the help of camera poses from SLAM system. Next, an iterative intra-and inter-frame attention network is used for temporal feature fusion. Finally, we estimate the 3D poses with the pre-trained VQ-VAE decoder.
  • Figure 3: Qualitative comparisons on both EMHI (the first two rows) and Eva-3M (the last two rows) datasets. For better illustration, the predicted (red) and ground-truth (green) 3D poses are re-projected onto external reference views which are not used for pose estimation.
  • Figure 4: Visualization of some representative fitted SMPL meshes.
  • Figure 5: Representative samples of the keypoint visibility labels in our Eva-3M dataset.
  • ...and 1 more figures