Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting

Taeho Kang; Youngki Lee

Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting

Taeho Kang, Youngki Lee

TL;DR

EgoTAP addresses stereo egocentric 3D pose lifting under severe self-occlusion by introducing a Grid ViT Heatmap Encoder that preserves joint-heatmap correspondence and captures inter-joint relations, paired with a skeletal-aware Propagation Network that propagates reliable cues from proximal joints to obscure distal joints. The two components work in concert through a Propagation Unit inspired by LSTM, enabling explicit use of skeletal hierarchy for accurate 3D pose estimation. Across UnrealEgo and EgoCap datasets, EgoTAP achieves state-of-the-art results, with substantial reductions in standard pose errors over prior methods and robust qualitative improvements in occluded scenarios. The approach offers a practical path toward reliable egocentric pose tracking for VR/AR applications, with clear avenues for temporal integration and broader pose-space generalization.

Abstract

We present EgoTAP, a heatmap-to-3D pose lifting method for highly accurate stereo egocentric 3D pose estimation. Severe self-occlusion and out-of-view limbs in egocentric camera views make accurate pose estimation a challenging problem. To address the challenge, prior methods employ joint heatmaps-probabilistic 2D representations of the body pose, but heatmap-to-3D pose conversion still remains an inaccurate process. We propose a novel heatmap-to-3D lifting method composed of the Grid ViT Encoder and the Propagation Network. The Grid ViT Encoder summarizes joint heatmaps into effective feature embedding using self-attention. Then, the Propagation Network estimates the 3D pose by utilizing skeletal information to better estimate the position of obscure joints. Our method significantly outperforms the previous state-of-the-art qualitatively and quantitatively demonstrated by a 23.9\% reduction of error in an MPJPE metric. Our source code is available in GitHub.

Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting

TL;DR

Abstract

Paper Structure (47 sections, 41 equations, 10 figures, 5 tables)

This paper contains 47 sections, 41 equations, 10 figures, 5 tables.

Introduction
Related Works
Egocentric Pose Estimation
3D Human Pose Estimation with Transformer
Skeletal Network Models
Method
Overview
Grid ViT Heatmap Encoder
Propagation Network
Evaluation
Experiment Setup
Datasets
Baselines
Metrics
Overall Performance
...and 32 more sections

Figures (10)

Figure 1: The stereo egocentric input and the comparison of the estimated pose of the state-of-the-art method kang2023ego3dpose and ours. Blue color for the ground truth and red color for the respective method's estimation
Figure 2: The architecture of the common baseline heatmap-to-3D approach. This architecture is adopted by monocular $x$R-EgoPose tome2019xr and stereo UnrealEgo hakada2022unrealego for 3D pose inference.
Figure 3: Comparison of the reconstructed heatmaps from the encoded heatmap features, with the frozen encoder from (c) CNN Encoder and (d) Grid ViT Encoder of the pose estimation model.
Figure 4: Overall network architecture of EgoTAP. EgoTAP takes heatmaps from pre-trained heatmap estimators taking stereo input images and lifts the heatmaps to the 3D pose with the Grid ViT Encoder, Propagation Network, and finally, a projection layer.
Figure 5: The Propagation Network with two layers of Propagation Unit.
...and 5 more figures

Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting

TL;DR

Abstract

Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting

Authors

TL;DR

Abstract

Table of Contents

Figures (10)