Table of Contents
Fetching ...

Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting

Taeho Kang, Youngki Lee

TL;DR

EgoTAP addresses stereo egocentric 3D pose lifting under severe self-occlusion by introducing a Grid ViT Heatmap Encoder that preserves joint-heatmap correspondence and captures inter-joint relations, paired with a skeletal-aware Propagation Network that propagates reliable cues from proximal joints to obscure distal joints. The two components work in concert through a Propagation Unit inspired by LSTM, enabling explicit use of skeletal hierarchy for accurate 3D pose estimation. Across UnrealEgo and EgoCap datasets, EgoTAP achieves state-of-the-art results, with substantial reductions in standard pose errors over prior methods and robust qualitative improvements in occluded scenarios. The approach offers a practical path toward reliable egocentric pose tracking for VR/AR applications, with clear avenues for temporal integration and broader pose-space generalization.

Abstract

We present EgoTAP, a heatmap-to-3D pose lifting method for highly accurate stereo egocentric 3D pose estimation. Severe self-occlusion and out-of-view limbs in egocentric camera views make accurate pose estimation a challenging problem. To address the challenge, prior methods employ joint heatmaps-probabilistic 2D representations of the body pose, but heatmap-to-3D pose conversion still remains an inaccurate process. We propose a novel heatmap-to-3D lifting method composed of the Grid ViT Encoder and the Propagation Network. The Grid ViT Encoder summarizes joint heatmaps into effective feature embedding using self-attention. Then, the Propagation Network estimates the 3D pose by utilizing skeletal information to better estimate the position of obscure joints. Our method significantly outperforms the previous state-of-the-art qualitatively and quantitatively demonstrated by a 23.9\% reduction of error in an MPJPE metric. Our source code is available in GitHub.

Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting

TL;DR

EgoTAP addresses stereo egocentric 3D pose lifting under severe self-occlusion by introducing a Grid ViT Heatmap Encoder that preserves joint-heatmap correspondence and captures inter-joint relations, paired with a skeletal-aware Propagation Network that propagates reliable cues from proximal joints to obscure distal joints. The two components work in concert through a Propagation Unit inspired by LSTM, enabling explicit use of skeletal hierarchy for accurate 3D pose estimation. Across UnrealEgo and EgoCap datasets, EgoTAP achieves state-of-the-art results, with substantial reductions in standard pose errors over prior methods and robust qualitative improvements in occluded scenarios. The approach offers a practical path toward reliable egocentric pose tracking for VR/AR applications, with clear avenues for temporal integration and broader pose-space generalization.

Abstract

We present EgoTAP, a heatmap-to-3D pose lifting method for highly accurate stereo egocentric 3D pose estimation. Severe self-occlusion and out-of-view limbs in egocentric camera views make accurate pose estimation a challenging problem. To address the challenge, prior methods employ joint heatmaps-probabilistic 2D representations of the body pose, but heatmap-to-3D pose conversion still remains an inaccurate process. We propose a novel heatmap-to-3D lifting method composed of the Grid ViT Encoder and the Propagation Network. The Grid ViT Encoder summarizes joint heatmaps into effective feature embedding using self-attention. Then, the Propagation Network estimates the 3D pose by utilizing skeletal information to better estimate the position of obscure joints. Our method significantly outperforms the previous state-of-the-art qualitatively and quantitatively demonstrated by a 23.9\% reduction of error in an MPJPE metric. Our source code is available in GitHub.
Paper Structure (47 sections, 41 equations, 10 figures, 5 tables)

This paper contains 47 sections, 41 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: The stereo egocentric input and the comparison of the estimated pose of the state-of-the-art method kang2023ego3dpose and ours. Blue color for the ground truth and red color for the respective method's estimation
  • Figure 2: The architecture of the common baseline heatmap-to-3D approach. This architecture is adopted by monocular $x$R-EgoPose tome2019xr and stereo UnrealEgo hakada2022unrealego for 3D pose inference.
  • Figure 3: Comparison of the reconstructed heatmaps from the encoded heatmap features, with the frozen encoder from (c) CNN Encoder and (d) Grid ViT Encoder of the pose estimation model.
  • Figure 4: Overall network architecture of EgoTAP. EgoTAP takes heatmaps from pre-trained heatmap estimators taking stereo input images and lifts the heatmaps to the 3D pose with the Grid ViT Encoder, Propagation Network, and finally, a projection layer.
  • Figure 5: The Propagation Network with two layers of Propagation Unit.
  • ...and 5 more figures