Table of Contents
Fetching ...

EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR

Zhenyu Li, Sai Kumar Dwivedi, Filip Maric, Carlos Chacon, Nadine Bertsch, Filippo Arcadu, Tomas Hodan, Michael Ramamonjisoa, Peter Wonka, Amy Zhao, Robin Kips, Cem Keskin, Anastasia Tkach, Chenhongyi Yang

TL;DR

The EgoPoseFormer v2 model is a transformer-based model for temporally consistent and spatially grounded body pose estimation, and an auto-labeling system that enables the use of large unlabeled datasets for training.

Abstract

Egocentric human motion estimation is essential for AR/VR experiences, yet remains challenging due to limited body coverage from the egocentric viewpoint, frequent occlusions, and scarce labeled data. We present EgoPoseFormer v2, a method that addresses these challenges through two key contributions: (1) a transformer-based model for temporally consistent and spatially grounded body pose estimation, and (2) an auto-labeling system that enables the use of large unlabeled datasets for training. Our model is fully differentiable, introduces identity-conditioned queries, multi-view spatial refinement, causal temporal attention, and supports both keypoints and parametric body representations under a constant compute budget. The auto-labeling system scales learning to tens of millions of unlabeled frames via uncertainty-aware semi-supervised training. The system follows a teacher-student schema to generate pseudo-labels and guide training with uncertainty distillation, enabling the model to generalize to different environments. On the EgoBody3M benchmark, with a 0.8 ms latency on GPU, our model outperforms two state-of-the-art methods by 12.2% and 19.4% in accuracy, and reduces temporal jitter by 22.2% and 51.7%. Furthermore, our auto-labeling system further improves the wrist MPJPE by 13.1%.

EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR

TL;DR

The EgoPoseFormer v2 model is a transformer-based model for temporally consistent and spatially grounded body pose estimation, and an auto-labeling system that enables the use of large unlabeled datasets for training.

Abstract

Egocentric human motion estimation is essential for AR/VR experiences, yet remains challenging due to limited body coverage from the egocentric viewpoint, frequent occlusions, and scarce labeled data. We present EgoPoseFormer v2, a method that addresses these challenges through two key contributions: (1) a transformer-based model for temporally consistent and spatially grounded body pose estimation, and (2) an auto-labeling system that enables the use of large unlabeled datasets for training. Our model is fully differentiable, introduces identity-conditioned queries, multi-view spatial refinement, causal temporal attention, and supports both keypoints and parametric body representations under a constant compute budget. The auto-labeling system scales learning to tens of millions of unlabeled frames via uncertainty-aware semi-supervised training. The system follows a teacher-student schema to generate pseudo-labels and guide training with uncertainty distillation, enabling the model to generalize to different environments. On the EgoBody3M benchmark, with a 0.8 ms latency on GPU, our model outperforms two state-of-the-art methods by 12.2% and 19.4% in accuracy, and reduces temporal jitter by 22.2% and 51.7%. Furthermore, our auto-labeling system further improves the wrist MPJPE by 13.1%.
Paper Structure (26 sections, 12 equations, 8 figures, 8 tables)

This paper contains 26 sections, 12 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Results on an in-the-wild sequence. Top to bottom: (1) input egocentric images with 2D projections of estimated body joints, (2) outside-in capture, and (3) renderings of estimated body motion. EPFv2 demonstrates accurate and temporally-consistent estimates.
  • Figure 2: Architecture overview (left). We stack two transformer decoders for coarse-to-fine pose estimation. A single holistic query, initialized from auxiliary metadata, attends to multi-view features and historic information to estimate 3D keypoints, pose parameters, and per-joint uncertainty in an end-to-end differentiable architecture. Illustration of the two core attention modules (right). Causal temporal attention enables each frame to attend to its temporal history. Conditioned multi-view cross attention incorporates both view identity and optional 2D keypoint projections of pose proposal to guide spatial feature aggregation across views.
  • Figure 3: Per-keypoint uncertainty predicted by EPFv2. Larger ellipse extent and higher transparency indicate higher predicted uncertainty. Prediction is in green whereas GT is in red.
  • Figure 4: Overview of the mixture training in auto-labeling system. We adopt a stronger teacher model for pesudo labeling and apply an uncertainty distillation loss to facilitate the knowledge transfer. The teacher model is pre-trained with the labeled dataset $\mathcal{D}_{l}=\{(x_l, y_l)\}$ before this stage.
  • Figure 5: Qualitative results on Egobody3M. Predictions are colored in green and ground-truths are colored in orange.
  • ...and 3 more figures